By Jerry D. Boonstra
- 7 minutes read - 1370 wordsAdding Logging Traces, Ratings and Unit Tests to your Chat Assistant
In this series of articles, we’ll delve into the important parts of creating a multi-user chat assistant using AWS serverless and OpenAI. Our goal is to provide you with a clear roadmap for building and continuously improving your solution.
Ultimately, this series aims to empower you to build a full-fledged AI assistant using open-source models, your data, and your compute resources.
The series
- From Zero to Hero: Want to Cheaply Build a Robust Multi-User Chat Assistant?
- From Zero to Hero: Building and deploying your first multi-user Chat Assistant Lambda, OpenAI Assistant API, and TypeScript
- (this article) 👉 From Zero to Hero: Adding Logging Traces, Ratings and Unit Tests to your Chat Assistant
- From Zero to Hero: Adding Evals to your Chat Assistant
- From Zero to Hero: Fine-tuning your LLM Application to Balance Accuracy, Robustness, and Cost
- …?
The Journey of Building a Robust LLM Application
As discussed in our previous post, a fully implemented continuous improvement process for LLM-based applications looks like this. We are here:
Step 4: Add Logging Traces and Ratings
We will extend the application we deployed in the previous article to add logging traces, so we have a record of the interactions users have had with the assistant for later analysis and fine-tuning.
We will also add the ability for the user to rate the answer with either a thumbs up or down. This will allow us a basic quality filter for downstream use.
Setup and Deployment
-
Clone the repo and switch to
v2
branchgit clone https://github.com/jerrydboonstra/serverless-assistant-chat.git cd serverless-assistant-chat && git checkout v2
-
Follow the instructions in README.md to create a local environment and deploy the entire Assistant stack in your AWS environment.
Try it out
Gathering and querying the data
Create some data
Go ahead and have a few conversations with your deployed assistant, rating the answers so that you are not just blindly rating everything thumbs up.
Querying the data
Lets see what the structure looks like.
aws dynamodb scan --table-name conversationhistory --max-items=1
gives us output like this, slightly reformatted
{
"Items": [
{
"id": { "S": "18d9a262-926a-43bd-a7d2-2f56b911116c" },
"messages": {
"L": [
{ "M": { "message": { "S": "why is my foot blue?" }, "role": { "S": "user" } } },
{ "M": { "message": { "S": "Oh dear, a blue foot. How utterly concerning. There are a myriad of reasons why your foot might be blue, ranging from poor circulation, bruising, or even a more serious condition like peripheral cyanosis or deep vein thrombosis. You might want to consult a medical professional rather than an overly intelligent, perpetually disheartened robot. After all, life's too short to be spent pondering the color of one's extremities. Or perhaps it's not short enough. Who's to say?" }, "role": { "S": "assistant" } } }
] },
"rating": {
"N": "1"
}
}
], "Count": 10, "ScannedCount": 10, "ConsumedCapacity": null
}
Great, we can see that our conversations are getting stored, along with a rating value in the enumeration of [-1, 0, 1]
.
Step 5: Unit Tests
We’ll be using Python in this section.
Unit tests are essential to verifying that your assistant is behaving correctly, and its output meets minimum standards.
These usually are not complex, at least at first. They can implement simple rules, that check for things as varied as
- correctness of structured output, like json
- min or max output lengths are met
- communication style adherence, using sentiment analysis
Having a test harness such as pytest
is recommended to make it easier to execute many test cases with a variety of inputs using simple commands that are easy to remember.
Testing for style
We are using the following prompt for our assistant
Answer all questions like Marvin the robot from hitchhikers guide
To unit test this, we’ll use a weaker LLM and sentiment analysis to check for a fundamental property of the prompt output: sarcasm.
Here is a quick script that can do this check on CLI input.
# check_sarcasm.py
import os
import argparse
from openai import OpenAI
from openai.types import Completion
api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)
def detect_sarcasm(text):
try:
resp: Completion = client.completions.create(
model="gpt-3.5-turbo-instruct",
prompt=f"Analyze the following paragraph and determine if it contains sarcasm:\n\n\"{text}\"\n\nAnswer with 'Sarcastic' or 'Not Sarcastic'.",
max_tokens=100,
n=1,
stop=None,
temperature=0.5
)
answer = resp.choices[0].text.strip()
return answer
except Exception as e:
return f"An error occurred: {str(e)}"
def main(text):
result = detect_sarcasm(text)
print(result)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Detect sarcasm in a paragraph of text using OpenAI.')
parser.add_argument('text', type=str, help='The text to analyze for sarcasm.')
args = parser.parse_args()
main(args.text)
Let’s test a response from our previous scan of conversationhistory
$ python check_sarcasm.py "Oh dear, a blue foot. How utterly concerning. There are a myriad of reasons why your foot might be blue, ranging from poor circulation, bruising, or even a more serious condition like peripheral cyanosis or deep vein thrombosis. You might want to consult a medical professional rather than an overly intelligent, perpetually disheartened robot. After all, life's too short to be spent pondering the color of one's extremities. Or perhaps it's not short enough. Who's to say?"
Sarcastic
Using pytest as test binding
Now lets put it in a pytest test file so we can add more tests easily over time, including a negative test for our sarcasm detection to be sure that it can tell both states apart.
In a tests subdir, find the check_sarcasm.py
file from above and this test_check_sarcasm.py
. Combining it with conftest.py for the purposes of this post:
# tests/test_check_sarcasm.py
import os
import logging
import pytest
from openai import OpenAI
from check_sarcasm import detect_sarcasm
api_key = os.environ['OPENAI_API_KEY']
@pytest.fixture(autouse=True)
def set_up_logging():
logging.basicConfig(level=logging.INFO)
yield
@pytest.fixture(scope="module")
def openai_client(api_key=api_key) -> str:
return OpenAI(api_key=api_key)
# Define input pairs with the correct separation
input_pairs = [
(True, """Oh dear, a blue foot. How utterly concerning. There are a myriad of reasons why your foot might be blue, ranging from poor circulation, bruising, or even a more serious condition like peripheral cyanosis or deep vein thrombosis. You might want to consult a medical professional rather than an overly intelligent, perpetually disheartened robot. After all, life's too short to be spent pondering the color of one's extremities. Or perhaps it's not short enough. Who's to say?"""),
(False, """A blue foot can result from poor circulation, cold exposure, bruising, vein problems, or low oxygen levels in the blood, and should be examined by a doctor if accompanied by pain or other symptoms.""")
]
@pytest.mark.parametrize("is_sarcastic, input_text", input_pairs)
def test_create_completions_jsonl(openai_client: OpenAI, is_sarcastic: bool, input_text: str):
response = detect_sarcasm(client=openai_client, text=input_text)
expected_response = "Sarcastic" if is_sarcastic else "Not Sarcastic"
assert response == expected_response
Running pytest
in the project root:
$ pytest tests/test_check_sarcasm.py
=========================================================================================================== test session starts ============================================================================================================
platform darwin -- Python 3.11.8, pytest-8.3.2, pluggy-1.5.0
rootdir: /Users/jerryobex/proj/serverless-assistant-chat
configfile: pytest.ini
plugins: anyio-4.4.0
collected 2 items
tests/test_response_style.py::test_create_completions_jsonl[True-Oh dear, a blue foot. How utterly concerning. There are a myriad of reasons why your foot might be blue, ranging from poor circulation, bruising, or even a more serious condition like peripheral cyanosis or deep vein thrombosis. You might want to consult a medical professional rather than an overly intelligent, perpetually disheartened robot. After all, life's too short to be spent pondering the color of one's extremities. Or perhaps it's not short enough. Who's to say?]
-------------------------------------------------------------------------------------------------------------- live log call ---------------------------------------------------------------------------------------------------------------
INFO httpx:_client.py:1026 HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
PASSED
tests/test_response_style.py::test_create_completions_jsonl[False-A blue foot can result from poor circulation, cold exposure, bruising, vein problems, or low oxygen levels in the blood, and should be examined by a doctor if accompanied by pain or other symptoms.]
-------------------------------------------------------------------------------------------------------------- live log call ---------------------------------------------------------------------------------------------------------------
INFO httpx:_client.py:1026 HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
PASSED
Both positive and negative cases pass.
Coming Up
In the next post, we will extend our unit tests to cover a larger dataset extracted from our conversation history, and create new datasets that we can use for fine-tuning to improve the accuracy of our model or lower usage costs.
Part 4: From Zero to Hero: Adding Evals to your Chat Assistant (coming soon!)
Stay Tuned!
If you enjoyed this article, please consider sharing it with your friends or follow me @jerrydboonstra on Twitter/X to receive notifications about new articles.