Salesforce AI Unveils VoiceAgentRAG: Boosting Voice RAG Retrieval Speed
In the realm of voice AI, the distinction between a helpful assistant and an awkward interaction is measured in milliseconds. While text-based Retrieval-Augmented Generation (RAG) systems can afford a few seconds of 'thinking' time, voice agents must respond within a 200ms budget to maintain a natural conversational flow. Standard production vector database queries typically add 50-300ms of network latency, effectively consuming the entire budget before a large language model (LLM) even begins generating a response.
The Salesforce AI research team has introduced VoiceAgentRAG, an open-source dual-agent architecture designed to bypass this retrieval bottleneck by decoupling document fetching from response generation. VoiceAgentRAG operates as a memory router that orchestrates two concurrent agents via an asynchronous event bus.
The Fast Talker (Foreground Agent) handles the critical latency path. For every user query, it first checks a local, in-memory Semantic Cache. If the required context is present, the lookup takes approximately 0.35ms. On a cache miss, it falls back to the remote vector database and immediately caches the results for future queries.
The Slow Thinker (Background Agent) runs as a background task, continuously monitoring the conversation stream. It uses a sliding window of the last six conversation turns to predict 3-5 likely follow-up topics. It then pre-fetches relevant document chunks from the remote vector store into the local cache before the user even speaks their next question.
The system’s efficiency hinges on a specialized semantic cache implemented with an in-memory FAISS IndexFlat IP. Unlike passive caches that index by query meaning, VoiceAgentRAG indexes entries by their own document embeddings. This allows the cache to perform a proper semantic search over its contents, ensuring relevance even if the user’s phrasing differs from the system’s predictions.
The research team evaluated the system using Qdrant Cloud as a remote vector database across 200 queries and 10 conversation scenarios, achieving an overall cache hit rate of 75% and a retrieval speedup of 316x. The architecture is most effective in topically coherent scenarios, such as feature comparisons, where a 95% hit rate was achieved.
Похожие статьи
Launch Canvas in AI Mode for New Projects
Canvas in AI Mode is now available for everyone in the U.S., simplifying project creation.
Expanding Personal Intelligence Capabilities for Users
Personal Intelligence is expanding in the U.S., offering personalized recommendations and assistance in search.
Google Launches Free AI Training for Massachusetts Residents
Google launches free AI training for Massachusetts residents through the Grow with Google program.