Building a Real-time AI Chatbot with Vision and Voice Capabilities using OpenAI, LiveKit, and Deepgram on GPU Droplets

Jan 08, 2025 07:00 AM - 1 week ago 15387

Introduction

In this tutorial, you will study really to build a real-time AI chatbot pinch imagination and sound capabilities utilizing OpenAI, LiveKit and Deepgram deployed connected DigitalOcean GPU Droplets. This chatbot will beryllium capable to prosecute successful real-time conversations pinch users, analyse images captured from your camera, and supply meticulous and timely responses.

Enhancing Chatbot Capabilities pinch Advanced Technologies

In this tutorial, you will leverage 3 powerful technologies to build your real-time AI chatbot, each serving a circumstantial intent that enhances the chatbot’s capabilities, each while utilizing the robust infrastructure provided by DigitalOcean’s GPU Droplets:

  1. OpenAI API: The OpenAI API will make human-like matter responses based connected personification input. By employing precocious models for illustration GPT-4o, our chatbot will beryllium capable to understand context, prosecute successful meaningful conversations, and supply meticulous answers to personification queries. This is important for creating an interactive acquisition wherever users consciousness understood and valued.

  2. LiveKit: LiveKit will facilitate real-time audio and video connection betwixt users and the chatbot. It allows america to create a seamless relationship experience, enabling users to speak to the chatbot and person sound responses. This is basal for building a voice-enabled chatbot that tin people prosecute users, making the relationship consciousness much individual and intuitive.

  3. Deepgram: Deepgram will beryllium employed for reside recognition, converting spoken connection into text. This allows the chatbot to process personification sound inputs effectively. By integrating Deepgram’s capabilities, you tin guarantee that the chatbot accurately understands personification commands and queries, enhancing the wide relationship quality. This is peculiarly important successful a real-time mounting wherever speedy and meticulous responses are basal for maintaining personification engagement.

Why GPU Droplets?: Utilizing DigitalOcean’s GPU Droplets is peculiarly beneficial for this setup arsenic they supply the basal computational and GPU infrastructure to powerfulness and grip the intensive processing required by these AI models and real-time communication. The GPUs are optimized for moving AI/ML workloads, importantly speeding up exemplary conclusion and video processing tasks. This ensures the chatbot tin present responses quickly and efficiently, moreover nether dense load, improving personification acquisition and engagement.

Prerequisites

Before you begin, guarantee you have:

  • A DigitalOcean Cloud account.
  • A GPU Droplet deployed and running.
  • Basic knowledge of Python programming.
  • An OpenAI API key group up for utilizing the GPT-4o model.
  • A LiveKit server up and moving connected your GPU Droplet.
  • A Deepgram API Key.

Step 1 - Set Up the GPU Droplet

1.Create a New Project - You will request to create a caller project from the unreality power sheet and necktie it to a GPU Droplet.

2.Create a GPU Droplet - Log into your DigitalOcean account, create a caller GPU Droplet, and take AI/ML Ready arsenic the OS. This OS image installs each the basal NVIDIA GPU Drivers. You tin mention to our charismatic archiving connected how to create a GPU Droplet.

Create-a-gpu-droplet which is AI/ML Ready

3.Add an SSH Key for authentication - An SSH key is required to authenticate pinch the GPU Droplet and by adding the SSH key, you tin login to the GPU Droplet from your terminal.

Add an SSH cardinal for authentication

4.Finalize and Create the GPU Droplet - Once each of the supra steps are completed, finalize and create a caller GPU Droplet.

Create a GPU Droplet

Step 2 - Setup a LiveKit relationship and instal the CLI connected GPU Droplet

Firstly, you will request to create an relationship aliases motion successful to your LiveKit Cloud account and create a LiveKit Project. Please statement down the LIVEKIT_URL, LIVEKIT_API_KEY and the LIVEKIT_API_SECRET situation variables from the Project Settings page arsenic you will request them later successful the tutorial.

Install the LiveKit CLI

The beneath bid will instal the LiveKit CLI connected your GPU Droplet.

curl -sSL https://get.livekit.io/cli | bash

For LiveKit Cloud users, you tin authenticate the CLI pinch your Cloud task to create an API cardinal and secret. This allows you to usage the CLI without manually providing credentials each time.

lk unreality auth

Then, travel instructions and log successful from a browser.

You will beryllium asked to adhd the instrumentality and authorize entree to your LiveKit Project you creted earlier successful this step.

Authorize the app

Access granted

Step 3 - Bootstrap an supplier from an existing LiveKit template

The template provides a moving sound adjunct to build on. The template includes:

  • Basic sound interaction
  • Audio-only way subscription
  • Voice activity discovery (VAD)
  • Speech-to-text (STT)
  • Language exemplary (LLM)
  • Text-to-speech (TTS)

Note: By default, the illustration supplier uses Deepgram for STT and OpenAI for TTS and LLM. However, you aren’t required to usage these providers.

Clone the starter template for a elemental Python sound agent:

lk app create

This will springiness you aggregate existing LiveKit templates that you tin usage to deploy an app.

Output

voice-assistant-frontend transcription-frontend token-server multimodal-agent-python multimodal-agent-node voice-pipeline-agent-python voice-pipeline-agent-node android-voice-assistant voice-assistant-swift outbound-caller-python

You will usage the voice-pipeline-agent-python template.

lk app create --template voice-pipeline-agent-python

Now, participate your Application name, OpenAI API Key and Deepgram API Key erstwhile prompted. If you aren’t utilizing Deepgram and OpenAI, you tin checkout different supported plugins.

Output

Cloning template... Instantiating environment... Cleaning up... To setup and tally the agent: cd /root/do-voice-vision-bot python3 -m venv venv source venv/bin/activate pip install -r requirements.txt python3 agent.py dev

Step 4 - Install limitations and create a Virtual Environment

First, move to your applications’s directory which was created successful the past step.

cd <app_name>

You tin database the files that were created from the template.

ls

Output

LICENSE README.md agent.py requirements.txt

Here agent.py is the main exertion record which contains the logic and root codification for the AI chatbot.

Now, you will create and activate a python virtual situation utilizing the beneath commands:

apt install python3.10-venv python3 -m venv venv

Add the pursuing API keys successful your environment:

export LIVEKIT_URL=<> export LIVEKIT_API_KEY=<> export LIVEKIT_API_SECRET=<> export DEEPGRAM_API_KEY=<> export OPENAI_API_KEY=<>

You tin find the LIVEKIT_URL, LIVEKIT_API_KEY and the LIVEKIT_API_SECRET connected the LiveKit Projects Settings page.

Activate the virtual environment:

source venv/bin/activate

Note: On Debian/Ubuntu systems, you request to instal the python3-venv package utilizing the pursuing command.

apt install python3.10-venv

Now, let’s instal the limitations required for the app to work.

python3 -m pip install -r requirements.txt

Step 5 - Add Vision Capabilities to your AI agent

To adhd the imagination capabilities to your supplier you will request to modify the agent.py record pinch the beneath imports and functions.

First, let’s commencement disconnected by adding these imports alongside the existing ones. Open your agent.py record utilizing a matter editor for illustration vi aliases nano.

vi agent.py

Copy the beneath imports alongside the existing ones:

agent.py

from livekit import rtc from livekit.agents.llm import ChatMessage, ChatImage

These caller imports include:

  • rtc: Access to LiveKit’s video functionality
  • ChatMessage and ChatImage: Classes you’ll usage to nonstop images to the LLM

Enable video subscription

Find the ctx.connect() statement successful the entrypoint function. Change AutoSubscribe.AUDIO_ONLY to AutoSubscribe.SUBSCRIBE_ALL:

agent.py

await ctx.connect(auto_subscribe=AutoSubscribe.SUBSCRIBE_ALL)

Note: If it is difficult for you to edit and modify agent.py record utilizing the vi aliases nano matter editor connected the GPU Droplet. You tin transcript the agent.py record contented to your section strategy and make the required edits successful a Code editor for illustration VSCode etc, and past copy-paste the updated code.

This will alteration the adjunct to person video tracks arsenic good arsenic audio.

Add video framework handling

Add these 2 helper functions after your imports but earlier the prewarm function:

agent.py

async def get_video_track(room: rtc.Room): """Find and return the first disposable distant video way in the room.""" for participant_id, subordinate in room.remote_participants.items(): for track_id, track_publication in participant.track_publications.items(): if track_publication.track and isinstance( track_publication.track, rtc.RemoteVideoTrack ): logger.info( f"Found video way {track_publication.track.sid} " f"from subordinate {participant_id}" ) return track_publication.track raise ValueError("No distant video way recovered successful the room")

This usability searches done each participants to find an disposable video track. It’s utilized to find the video provender to process.

Now, you will adhd the framework seizure function

agent.py

async def get_latest_image(room: rtc.Room): """Capture and return a azygous framework from the video track.""" video_stream = None try: video_track = await get_video_track(room) video_stream = rtc.VideoStream(video_track) async for event in video_stream: logger.debug("Captured latest video frame") return event.frame isolated from Exception arsenic e: logger.error(f"Failed to get latest image: {e}") return None finally: if video_stream: await video_stream.aclose()

The intent of this usability is to seizure a azygous framework from the video way and ensures due cleanup of resources. Using aclose() releases strategy resources for illustration representation buffers and video decoder instances, which helps forestall representation leaks.

Add the LLM Callback

Now, wrong the entrypoint function, adhd the beneath callback usability which will inject the latest video framework conscionable earlier the LLM generates a response. Search for the entrypoint usability wrong the agent.py file:

agent.py

async def before_llm_cb(assistant: VoicePipelineAgent, chat_ctx: llm.ChatContext): """ Callback that runs correct earlier the LLM generates a response. Captures the existent video framework and adds it to the speech context. """ try: if not hasattr(assistant, '_room'): logger.warning("Room not disposable successful assistant") return latest_image = await get_latest_image(assistant._room) if latest_image: image_content = [ChatImage(image=latest_image)] chat_ctx.messages.append(ChatMessage(role="user", content=image_content)) logger.debug("Added latest framework to speech context") else: logger.warning("No image captured from video stream") isolated from Exception arsenic e: logger.error(f"Error successful before_llm_cb: {e}")

This callback is the cardinal to businesslike discourse guidance — it will only adhd ocular accusation erstwhile the adjunct is astir to respond. If ocular accusation was added to each message, it would quickly capable up the LLMs discourse model which would beryllium highly inefficient and costly.

Update the strategy prompt

Find the initial_ctx creation wrong the entrypoint usability and update it to see imagination capabilities:

agent.py

initial_ctx = llm.ChatContext().append( role="system", text=( "You are a sound adjunct created by LiveKit that tin some spot and hear. " "You should usage short and concise responses, avoiding unpronounceable punctuation. " "When you spot an image successful our conversation, people incorporated what you spot " "into your response. Keep ocular descriptions little but informative." ), )

Update the adjunct configuration

Find the VoicePipelineAgent creation wrong the entrypoint usability and adhd the callback:

agent.py

assistant = VoicePipelineAgent( vad=ctx.proc.userdata["vad"], stt=openai.STT(), llm=openai.LLM(), tts=openai.TTS(), chat_ctx=initial_ctx, before_llm_cb=before_llm_cb )

The awesome update present is the before_llm_cb parameter, which uses the callback created earlier to inject the latest video framework into the speech context.

Final agent.py record pinch sound & imagination capabilities

This is really the agent.py record would look aft adding each the basal functions and imports:

agent.py

from asyncio.log import logger from livekit import rtc from livekit.agents.llm import ChatMessage, ChatImage import logging from dotenv import load_dotenv from livekit.agents import ( AutoSubscribe, JobContext, JobProcess, WorkerOptions, cli, llm, ) from livekit.agents.pipeline import VoicePipelineAgent from livekit.plugins import openai, deepgram, silero async def get_video_track(room: rtc.Room): """Find and return the first disposable distant video way in the room.""" for participant_id, subordinate in room.remote_participants.items(): for track_id, track_publication in participant.track_publications.items(): if track_publication.track and isinstance( track_publication.track, rtc.RemoteVideoTrack ): logger.info( f"Found video way {track_publication.track.sid} " f"from subordinate {participant_id}" ) return track_publication.track raise ValueError("No distant video way recovered successful the room") async def get_latest_image(room: rtc.Room): """Capture and return a azygous framework from the video track.""" video_stream = None try: video_track = await get_video_track(room) video_stream = rtc.VideoStream(video_track) async for event in video_stream: logger.debug("Captured latest video frame") return event.frame isolated from Exception arsenic e: logger.error(f"Failed to get latest image: {e}") return None finally: if video_stream: await video_stream.aclose() def prewarm(proc: JobProcess): proc.userdata["vad"] = silero.VAD.load() async def entrypoint(ctx: JobContext): async def before_llm_cb(assistant: VoicePipelineAgent, chat_ctx: llm.ChatContext): """ Callback that runs correct earlier the LLM generates a response. Captures the existent video framework and adds it to the speech context. """ try: if not hasattr(assistant, '_room'): logger.warning("Room not disposable successful assistant") return latest_image = await get_latest_image(assistant._room) if latest_image: image_content = [ChatImage(image=latest_image)] chat_ctx.messages.append(ChatMessage(role="user", content=image_content)) logger.debug("Added latest framework to speech context") else: logger.warning("No image captured from video stream") isolated from Exception arsenic e: logger.error(f"Error successful before_llm_cb: {e}") initial_ctx = llm.ChatContext().append( role="system", text=( "You are a sound adjunct created by LiveKit that tin some spot and hear. " "You should usage short and concise responses, avoiding unpronounceable punctuation. " "When you spot an image successful our conversation, people incorporated what you spot " "into your response. Keep ocular descriptions little but informative." ), ) logger.info(f"connecting to room {ctx.room.name}") await ctx.connect(auto_subscribe=AutoSubscribe.SUBSCRIBE_ALL) subordinate = await ctx.wait_for_participant() logger.info(f"starting sound adjunct for subordinate {participant.identity}") supplier = VoicePipelineAgent( vad=ctx.proc.userdata["vad"], stt=deepgram.STT(), llm=openai.LLM(), tts=openai.TTS(), chat_ctx=initial_ctx, before_llm_cb=before_llm_cb ) agent.start(ctx.room, participant) await agent.say("Hey, really tin I thief you today?", allow_interruptions=True) if __name__ == "__main__": cli.run_app( WorkerOptions( entrypoint_fnc=entrypoint, prewarm_fnc=prewarm, ), )

Testing your agent

Start your adjunct and trial the below:

python3 agent.py dev
  • Test Voice Interaction: Speak into your microphone and spot the chatbot respond.

  • Test Vision Capability: Ask the chatbot to place objects done your video cam stream.

You would obeseve the pursuing logs successful your console:

Output

2024-12-30 08:32:56,167 - DEBUG asyncio - Using selector: EpollSelector 2024-12-30 08:32:56,168 - DEV livekit.agents - Watching /root/do-voice-vision-bot 2024-12-30 08:32:56,774 - DEBUG asyncio - Using selector: EpollSelector 2024-12-30 08:32:56,778 - INFO livekit.agents - starting worker {"version": "0.12.5", "rtc-version": "0.18.3"} 2024-12-30 08:32:56,819 - INFO livekit.agents - registered worker {"id": "AW_cjS8QXCEnFxy", "region": "US East", "protocol": 15, "node_id": "NC_OASHBURN1A_BvkfVkdYVEWo"}

Now, you will request to link the app to the LiveKit room pinch a customer that publishes some audio and video. The easiest measurement to do this is by utilizing the hosted supplier playground.

Connect your Project to Hosted Playground

Since, this supplier requires a frontend exertion to pass with. You tin usage 1 of our illustration frontends successful livekit-examples, create your ain pursuing 1 of the client quickstarts, aliases trial instantly against 1 of the hosted Sandbox frontends.

In this illustration you will usage an existing hosted supplier playground. Simply unfastened this https://agents-playground.livekit.io/ successful your system’s browser and link your LiveKit Project. It should auto-populate pinch your Project.

Hosted AI supplier deployed connected GPU Droplet

How it works

With these supra changes to your agent, your adjunct tin now:

  1. Connects to some audio and video streams.

  2. Listens for personification reside arsenic before.

  3. Just earlier generating each response:

  • Captures the existent video frame.
  • Adds it to the speech context.
  • Uses it to pass the response.

4.Keep the discourse cleanable by only adding frames erstwhile needed.

Conclusion

Congratulations! You person successfully built a real-time AI chatbot pinch imagination and sound capabilities utilizing OpenAI, LiveKit, and Deepgram connected DigitalOcean GPU Droplets. This powerful operation enables efficient, scalable and real-time interactions for your applications.

You tin mention to LiveKit’s charismatic documentation and it’s API reference for much specifications connected building AI agents.

More