Salesforce AI Unveils VoiceAgentRAG: Boosting Voice RAG Retrieval Speed
In the realm of voice AI, the distinction between a helpful assistant and an awkward interaction is measured in milliseconds. While text-based Retrieval-Augmented Generation (RAG) systems can afford a few seconds of 'thinking' time, voice agents must respond within a 200ms budget to maintain a natural conversational flow. Standard production vector database queries typically add 50-300ms of network latency, effectively consuming the entire budget before a large language model (LLM) even begins generating a response.
The Salesforce AI research team has introduced VoiceAgentRAG, an open-source dual-agent architecture designed to bypass this retrieval bottleneck by decoupling document fetching from response generation. VoiceAgentRAG operates as a memory router that orchestrates two concurrent agents via an asynchronous event bus.
The Fast Talker (Foreground Agent) handles the critical latency path. For every user query, it first checks a local, in-memory Semantic Cache. If the required context is present, the lookup takes approximately 0.35ms. On a cache miss, it falls back to the remote vector database and immediately caches the results for future queries.
The Slow Thinker (Background Agent) runs as a background task, continuously monitoring the conversation stream. It uses a sliding window of the last six conversation turns to predict 3-5 likely follow-up topics. It then pre-fetches relevant document chunks from the remote vector store into the local cache before the user even speaks their next question.
The system’s efficiency hinges on a specialized semantic cache implemented with an in-memory FAISS IndexFlat IP. Unlike passive caches that index by query meaning, VoiceAgentRAG indexes entries by their own document embeddings. This allows the cache to perform a proper semantic search over its contents, ensuring relevance even if the user’s phrasing differs from the system’s predictions.
The research team evaluated the system using Qdrant Cloud as a remote vector database across 200 queries and 10 conversation scenarios, achieving an overall cache hit rate of 75% and a retrieval speedup of 316x. The architecture is most effective in topically coherent scenarios, such as feature comparisons, where a 95% hit rate was achieved.
Liquid AI Releases LFM2.5-350M: A Compact 350M Parameter Model
Transform AI with the Together AI Kernels Team
Похожие статьи
Create pixel art with Retro Diffusion models on Replicate
Retro Diffusion has released models for creating retro graphics on Replicate.
Compare Image Editing Models for Optimal Choice
Compare various image editing models and choose the best one for your needs.
Create Music with Lyria 3, Our Newest Generation Model
Discover the new music generation model Lyria 3 from Google, available for developers.