From Zero to Hero: Building and Deploying Your First Multi-User Chat Assistant with Lambda, OpenAI Assistant API, and TypeScript

By Jerry D. Boonstra

June 25, 2024 - 6 minutes read - 1208 words

If you want to make an apple pie from scratch, you must first create the universe. — Carl Sagan

In this series of articles, we’ll delve into the important parts of creating a multi-user chat assistant using AWS serverless and OpenAI. Our goal is to provide you with a clear roadmap for building and continuously improving your solution.

Ultimately, this series aims to empower you to build a full-fledged AI assistant using open-source models, your data, and your compute resources.

The series

From Zero to Hero: Want to Cheaply Build a Robust Multi-User Chat Assistant?
(this article) 👉 From Zero to Hero: Building and deploying your first multi-user Chat Assistant Lambda, OpenAI Assistant API, and TypeScript
From Zero to Hero: Adding Logging Traces, Ratings and Unit Tests to your Chat Assistant
From Zero to Hero: Adding Evals to your Chat Assistant
From Zero to Hero: Fine-tuning your LLM Application to Balance Accuracy, Robustness, and Cost
…?

The Journey of Building a Robust LLM Application

As discussed in our previous post, a fully implemented continuous improvement process for LLM-based applications looks like this. We are here:

cycle

We created an Assistant instance at OpenAI and used the playground to do prompt engineering to get something that works most of the time. We completed Steps #0, 1, and 2 in a multi-step process.

Step 3: Building our prototype application

We want multiple users to be able to interact with our assistant as we continue our journey. It’s time to architect this multi-user prototype.

Why OpenAI?

OpenAI’s Assistant API is a robust tool for creating sophisticated chatbots. With the gpt-4o model, it is state-of-the-art.

Since its release, it has introduced features like:

State Management: Maintain conversation context across multiple interactions.
Knowledge Augmentation: Access up to 10,000 reference documents to enrich your chatbot’s responses.
Code Execution: Execute code snippets generated during conversations, adding a layer of dynamic interaction.
Function Calling: Enables real-time calls to external functions or APIs
Streaming Output: Streams partial results for immediate feedback.
Ability to Use Fine-Tuned Models: Allows tailoring responses through model fine-tuning.

These features enable us to build a powerful, flexible chat assistant with a wide range of applications.

Eventually you might want to move away from OpenAI due to cost or privacy concerns. We plan to provide a roadmap to do that, but either way it is good to have a comparison baseline.

Why Serverless?

Choosing a serverless architecture offers several advantages for our chat assistant:

Cost Efficiency: You only pay for what you use, making it cost-effective for applications with variable traffic.
Scalability: Serverless platforms automatically scale to handle demand, ensuring your application remains responsive under load.
Maintenance-Free: You don’t have to worry about managing servers, allowing you to focus on development.

However, it’s essential to be aware of limitations such as execution timeouts and cold start delays, which we’ll address in detail.

Can we build it?

When building our assistant using serverless, we need to address and work around certain limitations, including:

Aspect	Limitation	Observation/Workaround
Execution Timeout	API Gateway has a maximum execution timeout for synchronous connections of around 30 seconds.	The vast majority of single turn responses will execute in under 30 seconds. For others, we should be able to use websockets auto-reconnect, and the stateful and idempotent nature of the Assistant API. This is an area of uncertainty and we will have to test this.
Execution Context Persistence	State, data, etc. are not preserved across invocations, except for the `/tmp` directory and in-memory variables within a single container.	Leverage the stateful nature of the Assistant API. Use DynamoDB for tracking the assistant thread used per user across stateless Lambda invocations.
Library and Dependency Management	Some Python libraries, especially those requiring native dependencies, may be difficult to install and use in Lambda. There is a limit of 5 layers.	OpenAI NodeJs SDK has no native dependencies. We use a Lambda layer to provide this library and its dependencies to our application backend.
Memory and CPU Limitations	Lambda functions can only allocate up to 10 GB of memory, and CPU performance is proportional to the memory allocated.	A Lambda function for our purposes is unlikely to need more than 1 GB of memory, and in practice use less. Inferencing can be memory and GPU intensive, but it is done in the OpenAI infrastructure.
Deployment Package Size	Deployment package size cannot exceed 50MB when uploaded directly, or 250MB using an Amazon S3 bucket.	OpenAI library and dependencies weigh in ~17MB uncompressed which leaves 233MB of headroom for other libraries for your lambda application.
Cold Start Delays	Cold start delays are the latency experienced when executing a serverless application (such as AWS Lambda) for the first time after being idle.	Usually, they are not a deal breaker for low traffic applications. These delays can be mitigated by keeping functions warm, increasing function memory allocation, or using Provisioned Concurrency.

Costs

OpenAI

Assistants API can be used with a selection of models which vary in cost and quality.

By default, this project uses the gpt-4o model to demonstrate its humorous example prompt. The 3.5-turbo model offers a 10x cost savings. Ultimately, it will depend on your use case whether you can use the ultra-low-cost solution.

You can get great results using gpt-4o but, in June 2024, at inference you’ll be paying
- $5 per 1M tokens for input and $15 per 1M tokens for output.
For an ultra-low cost solution you can choose gpt-3.5-turbo which isn’t free but is getting close: inference is
- $0.50 per 1M tokens for input and $1.50 per 1M tokens for output.

Since an assistant instance is easy to spin up or modify using our codebase, you can easily change models and compare results to quickly determine which direction to go in.

AWS

For lower-traffic applications, it’s unlikely you will exceed the free tier.

Architecture

Here is our architecture diagram.

arch

We are using

Amazon CloudFront
Amazon S3
Amazon API Gateway with Websockets
Amazon Cognito
AWS Lambda
Amazon DynamoDB
AWS Secrets Manager
OpenAI Assistant API v2

Requirements

You will need:

an AWS account configured for CLI use, with wide permissions to be able to deploy entire CloudFormation stacks with serverless components.
an OpenAI account and API key, for storing your Assistant instance and for billing.

Code can be found at https://github.com/jerrydboonstra/serverless-assistant-chat/tree/v1

After cloning the codebase, we’ll need to

create our local environment
create our Assistant instance
create our backend deployment bucket

before running our CloudFormation template to create our stack.

Setup and Deployment

Clone the repo and switch to v1 branch

git clone https://github.com/jerrydboonstra/serverless-assistant-chat.git
cd serverless-assistant-chat && git checkout v1

Follow the instructions in README.md to create a local environment and deploy the entire Assistant stack in your AWS environment.

Try it out

screengrab

Making changes after stack deployment

A process is provided to make changes after deployment, allowing for iterative development.

Step 4: From Zero to Hero: Adding Logging Traces and Ratings to your Chat Assistant

Now your appetite is wetted, let’s make the stack production ready by Adding Logging Traces and also enable multiple users.

Part 3: From Zero to Hero: Adding Logging Traces and Ratings to your Chat Assistant

Stay Tuned!

If you enjoyed this article, please consider sharing it with your friends or follow me @jerrydboonstra on Twitter/X to receive notifications about new articles.

From Zero to Hero: LLM