RecipeAgent - Real-time Voice AI (Gordon Ramsay)

Project Overview

RecipeAgent is a sophisticated real-time voice AI system that embodies the personality of Gordon Ramsay while providing intelligent cooking assistance. The system combines advanced AI technologies including speech-to-text, large language models, text-to-speech, and Retrieval-Augmented Generation (RAG) to create an interactive cooking companion that can answer questions from a cookbook PDF and perform measurement conversions.

Key Features

Real-time Voice Interaction: Live audio processing with speech-to-text and text-to-speech capabilities
Gordon Ramsay Persona: AI agent programmed with Gordon Ramsay's distinctive personality and communication style
RAG-powered Cookbook Assistant: Answers cooking questions by retrieving relevant information from a local PDF cookbook
Measurement Conversion Tool: Converts between different cooking measurements and units
LiveKit Cloud Integration: Real-time media processing and job dispatch
React Frontend: Modern web interface with live transcription display

Technical Architecture

The system is built with a microservices architecture featuring:

Python Agent (LiveKit Agents): Core AI logic, persona management, tools, RAG, STT/TTS, and Voice Activity Detection
Token Server (FastAPI): Secure JWT token issuance for frontend authentication
React Frontend: User interface with real-time audio and transcription display
LiveKit Cloud: Media SFU and job dispatch system
LlamaIndex RAG: Document processing and vector search for cookbook queries

Technologies Used

Python LiveKit Agents React FastAPI LlamaIndex OpenAI GPT-4 Deepgram STT Cartesia TTS Vector Search RAG

RAG Implementation

The system uses LlamaIndex for document processing and retrieval:

Processes PDF cookbooks using SimpleDirectoryReader
Uses OpenAI text-embedding-3-small for vector embeddings
Implements SentenceSplitter with chunk_size=1024 and chunk_overlap=240
Persists VectorStoreIndex locally with versioning to prevent data mismatches
Provides context-aware responses with page citations when available

Workflow

1. User clicks "Start Call" in the frontend → token server issues JWT → frontend joins LiveKit room
2. LiveKit Cloud dispatches job to Python agent worker
3. Agent performs STT (Deepgram) → LLM reasoning (OpenAI) → optional tools → TTS (Cartesia/OpenAI)
4. Agent publishes audio and synchronized transcriptions to frontend
5. Frontend displays live transcripts and plays audio responses

Tools & Capabilities

query_cookbook(question): RAG-powered cookbook question answering with context retrieval
convert_measurements: Unit conversion tool with basic density assumptions (1 g/ml)
Voice Activity Detection: Silero VAD for turn detection and audio processing
Live Transcription: Real-time text streams with topic 'lk.transcription'