By Jerry D. Boonstra

July 21, 2024 - 7 minutes read - 1370 words

Adding Logging Traces, Ratings and Unit Tests to your Chat Assistant

In this series of articles, we’ll delve into the important parts of creating a multi-user chat assistant using AWS serverless and OpenAI. Our goal is to provide you with a clear roadmap for building and continuously improving your solution.

Ultimately, this series aims to empower you to build a full-fledged AI assistant using open-source models, your data, and your compute resources.

The series

From Zero to Hero: Want to Cheaply Build a Robust Multi-User Chat Assistant?
From Zero to Hero: Building and deploying your first multi-user Chat Assistant Lambda, OpenAI Assistant API, and TypeScript
(this article) 👉 From Zero to Hero: Adding Logging Traces, Ratings and Unit Tests to your Chat Assistant
From Zero to Hero: Adding Evals to your Chat Assistant
From Zero to Hero: Fine-tuning your LLM Application to Balance Accuracy, Robustness, and Cost
…?

The Journey of Building a Robust LLM Application

As discussed in our previous post, a fully implemented continuous improvement process for LLM-based applications looks like this. We are here:

cycle

Step 4: Add Logging Traces and Ratings

We will extend the application we deployed in the previous article to add logging traces, so we have a record of the interactions users have had with the assistant for later analysis and fine-tuning.

We will also add the ability for the user to rate the answer with either a thumbs up or down. This will allow us a basic quality filter for downstream use.

Setup and Deployment

Clone the repo and switch to v2 branch

git clone https://github.com/jerrydboonstra/serverless-assistant-chat.git
cd serverless-assistant-chat && git checkout v2

Follow the instructions in README.md to create a local environment and deploy the entire Assistant stack in your AWS environment.
View the diff between v2 and v1 from the previous article

Try it out

developer2

Gathering and querying the data

Create some data

Go ahead and have a few conversations with your deployed assistant, rating the answers so that you are not just blindly rating everything thumbs up.

Querying the data

Lets see what the structure looks like.

aws dynamodb scan --table-name conversationhistory --max-items=1

gives us output like this, slightly reformatted

{
    "Items": [
        {
            "id": { "S": "18d9a262-926a-43bd-a7d2-2f56b911116c" },
            "messages": {
                "L": [
                    { "M": { "message": { "S": "why is my foot blue?" }, "role": { "S": "user" } } },
                    { "M": { "message": { "S": "Oh dear, a blue foot. How utterly concerning. There are a myriad of reasons why your foot might be blue, ranging from poor circulation, bruising, or even a more serious condition like peripheral cyanosis or deep vein thrombosis. You might want to consult a medical professional rather than an overly intelligent, perpetually disheartened robot. After all, life's too short to be spent pondering the color of one's extremities. Or perhaps it's not short enough. Who's to say?" }, "role": { "S": "assistant" } } }
                ] },
            "rating": {
                "N": "1"
            }
        }
    ], "Count": 10, "ScannedCount": 10, "ConsumedCapacity": null
}

Great, we can see that our conversations are getting stored, along with a rating value in the enumeration of [-1, 0, 1].

Step 5: Unit Tests

We’ll be using Python in this section.

Unit tests are essential to verifying that your assistant is behaving correctly, and its output meets minimum standards.

These usually are not complex, at least at first. They can implement simple rules, that check for things as varied as

correctness of structured output, like json
min or max output lengths are met
communication style adherence, using sentiment analysis

Having a test harness such as pytest is recommended to make it easier to execute many test cases with a variety of inputs using simple commands that are easy to remember.

Testing for style

We are using the following prompt for our assistant

Answer all questions like Marvin the robot from hitchhikers guide

To unit test this, we’ll use a weaker LLM and sentiment analysis to check for a fundamental property of the prompt output: sarcasm.

Here is a quick script that can do this check on CLI input.

# check_sarcasm.py
import os
import argparse
from openai import OpenAI
from openai.types import Completion

api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)

def detect_sarcasm(text):
    try:
        resp: Completion = client.completions.create(
            model="gpt-3.5-turbo-instruct",
            prompt=f"Analyze the following paragraph and determine if it contains sarcasm:\n\n\"{text}\"\n\nAnswer with 'Sarcastic' or 'Not Sarcastic'.",
            max_tokens=100,
            n=1,
            stop=None,
            temperature=0.5
        )
        answer = resp.choices[0].text.strip()
        return answer
    except Exception as e:
        return f"An error occurred: {str(e)}"

def main(text):
    result = detect_sarcasm(text)
    print(result)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Detect sarcasm in a paragraph of text using OpenAI.')
    parser.add_argument('text', type=str, help='The text to analyze for sarcasm.')
    args = parser.parse_args()
    
    main(args.text)

Let’s test a response from our previous scan of conversationhistory

$ python check_sarcasm.py "Oh dear, a blue foot. How utterly concerning. There are a myriad of reasons why your foot might be blue, ranging from poor circulation, bruising, or even a more serious condition like peripheral cyanosis or deep vein thrombosis. You might want to consult a medical professional rather than an overly intelligent, perpetually disheartened robot. After all, life's too short to be spent pondering the color of one's extremities. Or perhaps it's not short enough. Who's to say?"
Sarcastic

Using pytest as test binding

Now lets put it in a pytest test file so we can add more tests easily over time, including a negative test for our sarcasm detection to be sure that it can tell both states apart.

In a tests subdir, find the check_sarcasm.py file from above and this test_check_sarcasm.py. Combining it with conftest.py for the purposes of this post:

# tests/test_check_sarcasm.py
import os
import logging
import pytest
from openai import OpenAI
from check_sarcasm import detect_sarcasm

api_key = os.environ['OPENAI_API_KEY']

@pytest.fixture(autouse=True)
def set_up_logging():
    logging.basicConfig(level=logging.INFO)
    yield

@pytest.fixture(scope="module")
def openai_client(api_key=api_key) -> str:
    return OpenAI(api_key=api_key)

# Define input pairs with the correct separation
input_pairs = [
    (True, """Oh dear, a blue foot. How utterly concerning. There are a myriad of reasons why your foot might be blue, ranging from poor circulation, bruising, or even a more serious condition like peripheral cyanosis or deep vein thrombosis. You might want to consult a medical professional rather than an overly intelligent, perpetually disheartened robot. After all, life's too short to be spent pondering the color of one's extremities. Or perhaps it's not short enough. Who's to say?"""),
    (False, """A blue foot can result from poor circulation, cold exposure, bruising, vein problems, or low oxygen levels in the blood, and should be examined by a doctor if accompanied by pain or other symptoms.""")
]

@pytest.mark.parametrize("is_sarcastic, input_text", input_pairs)
def test_create_completions_jsonl(openai_client: OpenAI, is_sarcastic: bool, input_text: str):
    response = detect_sarcasm(client=openai_client, text=input_text)
    expected_response = "Sarcastic" if is_sarcastic else "Not Sarcastic"
    assert response == expected_response

Running pytest in the project root:

$ pytest tests/test_check_sarcasm.py

=========================================================================================================== test session starts ============================================================================================================
platform darwin -- Python 3.11.8, pytest-8.3.2, pluggy-1.5.0
rootdir: /Users/jerryobex/proj/serverless-assistant-chat
configfile: pytest.ini
plugins: anyio-4.4.0
collected 2 items                                                                                                                                                                                                                          

tests/test_response_style.py::test_create_completions_jsonl[True-Oh dear, a blue foot. How utterly concerning. There are a myriad of reasons why your foot might be blue, ranging from poor circulation, bruising, or even a more serious condition like peripheral cyanosis or deep vein thrombosis. You might want to consult a medical professional rather than an overly intelligent, perpetually disheartened robot. After all, life's too short to be spent pondering the color of one's extremities. Or perhaps it's not short enough. Who's to say?] 
-------------------------------------------------------------------------------------------------------------- live log call ---------------------------------------------------------------------------------------------------------------
INFO     httpx:_client.py:1026 HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
PASSED
tests/test_response_style.py::test_create_completions_jsonl[False-A blue foot can result from poor circulation, cold exposure, bruising, vein problems, or low oxygen levels in the blood, and should be examined by a doctor if accompanied by pain or other symptoms.] 
-------------------------------------------------------------------------------------------------------------- live log call ---------------------------------------------------------------------------------------------------------------
INFO     httpx:_client.py:1026 HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
PASSED

Both positive and negative cases pass.

Coming Up

In the next post, we will extend our unit tests to cover a larger dataset extracted from our conversation history, and create new datasets that we can use for fine-tuning to improve the accuracy of our model or lower usage costs.

Part 4: From Zero to Hero: Adding Evals to your Chat Assistant (coming soon!)

Stay Tuned!

If you enjoyed this article, please consider sharing it with your friends or follow me @jerrydboonstra on Twitter/X to receive notifications about new articles.

From Zero to Hero: LLM