Google launches updated Gemini models to enhance voice interactions

46 просмотров Источник
Google launches updated Gemini models to enhance voice interactions

Google Launches Updated Gemini Models for Enhanced Voice Interactions

Overview

Google has updated its Gemini 2.5 Flash Native Audio model to improve the performance of voice agents. The innovations include more accurate function calls, better adherence to instructions, and improved dialogue flow. Users can now try the live speech translation feature in the beta version of the Google Translate app, available on Android in the US, Mexico, and India.

Key Points

  • Updated Gemini audio models enhance live agent performance and translation features.
  • Gemini 2.5 Flash Native Audio now boasts improved function calling and instruction adherence.
  • The update facilitates smoother conversations by remembering the context of previous interactions.
  • Live speech translation in Google Translate supports over 70 languages, maintaining intonation.
  • Developers can start creating voice agents using Gemini 2.5 Flash Native Audio on the Vertex AI platform.

Detailed Explanation

Google has enhanced its AI system Gemini for more precise understanding and participation in dialogues. The AI now better follows instructions, ensures smoother dialogues, and translates languages in real time. This innovation aids businesses in customer service and fosters communication between people speaking different languages. The live translation feature is available for testing in the Google Translate app.

Live Voice Agents

To support various applications, Gemini 2.5 Native Audio has been improved in three main areas:

  • More Accurate Function Calls: Increased reliability in activating external functions and integrating real-time information without disrupting the flow. Scored 71.5% on the ComplexFuncBench Audio test.
  • Robust Instruction Adherence: Enhanced handling of complex instructions with a 90% adherence rate, ensuring more complete content execution.
  • Smoother Conversations: Improved quality of multiple dialogues, allowing better context recovery from previous interactions.

Customer Feedback

Google Cloud customers use native audio Gemini to achieve real business results, from processing mortgage applications to engaging with clients.

  • “Users often forget they are interacting with AI and thank the bot after lengthy interactions. The new Live API AI capabilities through Gemini help our merchants succeed.” – David Wurtz, VP of Product, Shopify
  • “By integrating Gemini 2.5 Flash Native Audio, we have significantly enhanced Mia's capabilities since May 2025, creating over 14,000 loans for partner brokers.” – Jason Bressler, CTO, United Wholesale Mortgage (UWM)
  • “Using Gemini 2.5 Flash Native Audio on Vertex AI, Newo.ai's AI receptionists achieve an exceptional level of conversational intelligence, identifying main interlocutors even in noisy environments, switching languages mid-conversation, and sounding naturally expressive.” – David Yang, Co-founder, Newo.ai

Live Speech Translation

Gemini now supports live speech translation designed for continuous listening and bidirectional conversations. In continuous listening mode, it translates multiple languages into one target language, allowing users to hear translations directly. In bidirectional mode, it translates between two languages in real time, automatically switching output depending on the speaker.

  • Language Coverage: Translates over 70 languages and 2000 language pairs, leveraging Gemini's world knowledge and multilingual capabilities.
  • Style Transfer: Preserves the nuances of human speech, maintaining intonation, pace, and pitch for naturally sounding translations.
  • Multilingual Input: Understands multiple languages simultaneously in one session, aiding in multilingual conversations without the need for setup.
  • Automatic Detection: Recognizes the spoken language and automatically initiates translation.
  • Noise Resilience: Filters background noise, ensuring comfortable conversations even in noisy environments.

The beta version is available today in the Google Translate app for real-time translation through headphones on Android devices in the US, Mexico, and India. Support for iOS and additional regions will be available in the future.

Getting Started

You can start developing voice agents using Gemini 2.5 Flash Native Audio, now available on the Vertex AI platform. A preview version is also available through the Gemini API. Explore Google AI Studio to try it out.

Additionally, Gemini 2.5 Flash and 2.5 Pro models for text-to-speech conversion are available through the Gemini API in Google AI Studio. For more information, check the speech generation documentation, prompt guide, or Gemini API cookbook.

Похожие статьи