Salesforce AI Unveils VoiceAgentRAG: Boosting Voice RAG Retrieval Speed

11 просмотров Источник
Salesforce AI Unveils VoiceAgentRAG: Boosting Voice RAG Retrieval Speed

In the realm of voice AI, the distinction between a helpful assistant and an awkward interaction is measured in milliseconds. While text-based Retrieval-Augmented Generation (RAG) systems can afford a few seconds of 'thinking' time, voice agents must respond within a 200ms budget to maintain a natural conversational flow. Standard production vector database queries typically add 50-300ms of network latency, effectively consuming the entire budget before a large language model (LLM) even begins generating a response.

The Salesforce AI research team has introduced VoiceAgentRAG, an open-source dual-agent architecture designed to bypass this retrieval bottleneck by decoupling document fetching from response generation. VoiceAgentRAG operates as a memory router that orchestrates two concurrent agents via an asynchronous event bus.

The Fast Talker (Foreground Agent) handles the critical latency path. For every user query, it first checks a local, in-memory Semantic Cache. If the required context is present, the lookup takes approximately 0.35ms. On a cache miss, it falls back to the remote vector database and immediately caches the results for future queries.

The Slow Thinker (Background Agent) runs as a background task, continuously monitoring the conversation stream. It uses a sliding window of the last six conversation turns to predict 3-5 likely follow-up topics. It then pre-fetches relevant document chunks from the remote vector store into the local cache before the user even speaks their next question.

The system’s efficiency hinges on a specialized semantic cache implemented with an in-memory FAISS IndexFlat IP. Unlike passive caches that index by query meaning, VoiceAgentRAG indexes entries by their own document embeddings. This allows the cache to perform a proper semantic search over its contents, ensuring relevance even if the user’s phrasing differs from the system’s predictions.

The research team evaluated the system using Qdrant Cloud as a remote vector database across 200 queries and 10 conversation scenarios, achieving an overall cache hit rate of 75% and a retrieval speedup of 316x. The architecture is most effective in topically coherent scenarios, such as feature comparisons, where a 95% hit rate was achieved.

Похожие статьи