By Jerry D. Boonstra
- 6 minutes read - 1208 wordsIf you want to make an apple pie from scratch, you must first create the universe. — Carl Sagan
In this series of articles, we’ll delve into the important parts of creating a multi-user chat assistant using AWS serverless and OpenAI. Our goal is to provide you with a clear roadmap for building and continuously improving your solution.
Ultimately, this series aims to empower you to build a full-fledged AI assistant using open-source models, your data, and your compute resources.
The series
- From Zero to Hero: Want to Cheaply Build a Robust Multi-User Chat Assistant?
- (this article) 👉 From Zero to Hero: Building and deploying your first multi-user Chat Assistant Lambda, OpenAI Assistant API, and TypeScript
- From Zero to Hero: Adding Logging Traces, Ratings and Unit Tests to your Chat Assistant
- From Zero to Hero: Adding Evals to your Chat Assistant
- From Zero to Hero: Fine-tuning your LLM Application to Balance Accuracy, Robustness, and Cost
- …?
The Journey of Building a Robust LLM Application
As discussed in our previous post, a fully implemented continuous improvement process for LLM-based applications looks like this. We are here:
We created an Assistant instance at OpenAI and used the playground to do prompt engineering to get something that works most of the time. We completed Steps #0, 1, and 2 in a multi-step process.
Step 3: Building our prototype application
We want multiple users to be able to interact with our assistant as we continue our journey. It’s time to architect this multi-user prototype.
Why OpenAI?
OpenAI’s Assistant API is a robust tool for creating sophisticated chatbots. With the gpt-4o
model, it is state-of-the-art.
Since its release, it has introduced features like:
- State Management: Maintain conversation context across multiple interactions.
- Knowledge Augmentation: Access up to 10,000 reference documents to enrich your chatbot’s responses.
- Code Execution: Execute code snippets generated during conversations, adding a layer of dynamic interaction.
- Function Calling: Enables real-time calls to external functions or APIs
- Streaming Output: Streams partial results for immediate feedback.
- Ability to Use Fine-Tuned Models: Allows tailoring responses through model fine-tuning.
These features enable us to build a powerful, flexible chat assistant with a wide range of applications.
Eventually you might want to move away from OpenAI due to cost or privacy concerns. We plan to provide a roadmap to do that, but either way it is good to have a comparison baseline.
Why Serverless?
Choosing a serverless architecture offers several advantages for our chat assistant:
- Cost Efficiency: You only pay for what you use, making it cost-effective for applications with variable traffic.
- Scalability: Serverless platforms automatically scale to handle demand, ensuring your application remains responsive under load.
- Maintenance-Free: You don’t have to worry about managing servers, allowing you to focus on development.
However, it’s essential to be aware of limitations such as execution timeouts and cold start delays, which we’ll address in detail.
Can we build it?
When building our assistant using serverless, we need to address and work around certain limitations, including:
Aspect | Limitation | Observation/Workaround |
---|---|---|
Execution Timeout | API Gateway has a maximum execution timeout for synchronous connections of around 30 seconds. | The vast majority of single turn responses will execute in under 30 seconds. For others, we should be able to use websockets auto-reconnect, and the stateful and idempotent nature of the Assistant API. This is an area of uncertainty and we will have to test this. |
Execution Context Persistence | State, data, etc. are not preserved across invocations, except for the /tmp directory and in-memory variables within a single container. |
Leverage the stateful nature of the Assistant API. Use DynamoDB for tracking the assistant thread used per user across stateless Lambda invocations. |
Library and Dependency Management | Some Python libraries, especially those requiring native dependencies, may be difficult to install and use in Lambda. There is a limit of 5 layers. | OpenAI NodeJs SDK has no native dependencies. We use a Lambda layer to provide this library and its dependencies to our application backend. |
Memory and CPU Limitations | Lambda functions can only allocate up to 10 GB of memory, and CPU performance is proportional to the memory allocated. | A Lambda function for our purposes is unlikely to need more than 1 GB of memory, and in practice use less. Inferencing can be memory and GPU intensive, but it is done in the OpenAI infrastructure. |
Deployment Package Size | Deployment package size cannot exceed 50MB when uploaded directly, or 250MB using an Amazon S3 bucket. | OpenAI library and dependencies weigh in ~17MB uncompressed which leaves 233MB of headroom for other libraries for your lambda application. |
Cold Start Delays | Cold start delays are the latency experienced when executing a serverless application (such as AWS Lambda) for the first time after being idle. | Usually, they are not a deal breaker for low traffic applications. These delays can be mitigated by keeping functions warm, increasing function memory allocation, or using Provisioned Concurrency. |
Costs
OpenAI
Assistants API can be used with a selection of models which vary in cost and quality.
By default, this project uses the gpt-4o
model to demonstrate its humorous example prompt. The 3.5-turbo
model offers a 10x cost savings. Ultimately, it will depend on your use case whether you can use the ultra-low-cost solution.
- You can get great results using
gpt-4o
but, in June 2024, at inference you’ll be paying$5
per1M
tokens for input and$15
per1M
tokens for output.
- For an ultra-low cost solution you can choose
gpt-3.5-turbo
which isn’t free but is getting close: inference is$0.50
per1M
tokens for input and$1.50
per1M
tokens for output.
Since an assistant instance is easy to spin up or modify using our codebase, you can easily change models and compare results to quickly determine which direction to go in.
AWS
For lower-traffic applications, it’s unlikely you will exceed the free tier.
Architecture
Here is our architecture diagram.
We are using
- Amazon CloudFront
- Amazon S3
- Amazon API Gateway with Websockets
- Amazon Cognito
- AWS Lambda
- Amazon DynamoDB
- AWS Secrets Manager
- OpenAI Assistant API v2
Requirements
You will need:
- an AWS account configured for CLI use, with wide permissions to be able to deploy entire CloudFormation stacks with serverless components.
- an OpenAI account and API key, for storing your Assistant instance and for billing.
Code can be found at https://github.com/jerrydboonstra/serverless-assistant-chat/tree/v1
After cloning the codebase, we’ll need to
- create our local environment
- create our Assistant instance
- create our backend deployment bucket
before running our CloudFormation template to create our stack.
Setup and Deployment
-
Clone the repo and switch to
v1
branchgit clone https://github.com/jerrydboonstra/serverless-assistant-chat.git cd serverless-assistant-chat && git checkout v1
-
Follow the instructions in README.md to create a local environment and deploy the entire Assistant stack in your AWS environment.
Try it out
Making changes after stack deployment
A process is provided to make changes after deployment, allowing for iterative development.
Step 4: From Zero to Hero: Adding Logging Traces and Ratings to your Chat Assistant
Now your appetite is wetted, let’s make the stack production ready by Adding Logging Traces and also enable multiple users.
Part 3: From Zero to Hero: Adding Logging Traces and Ratings to your Chat Assistant
Stay Tuned!
If you enjoyed this article, please consider sharing it with your friends or follow me @jerrydboonstra on Twitter/X to receive notifications about new articles.