Google Launches Gemini Embedding 2: First Natively Multimodal Embedding Model Supporting Text, Images, Video, and Audio
Key Takeaways
- ▸Gemini Embedding 2 is Google's first natively multimodal embedding model, unifying text, images, videos, audio, and documents in a single embedding space
- ▸The model supports interleaved multimodal inputs, capturing complex relationships between different media types in a single request
- ▸Available now in public preview through Gemini API and Vertex AI, with integrations already available in LangChain, LlamaIndex, Haystack, Weaviate, QDrant, ChromaDB, and Vector Search
Summary
Google has announced Gemini Embedding 2, its first natively multimodal embedding model now available in public preview via the Gemini API and Vertex AI. The model maps text, images, videos, audio, and documents into a single unified embedding space, enabling semantic search and retrieval across multiple media types in over 100 languages. This represents a significant expansion from previous text-only embedding models, allowing developers to process interleaved inputs (e.g., image + text in a single request) and capture nuanced relationships between different media types.
Gemini Embedding 2 supports comprehensive input across multiple modalities: text inputs up to 8,192 tokens, up to 6 images per request, videos up to 120 seconds long, native audio processing without transcription, and PDFs up to 6 pages. The model incorporates Matryoshka Representation Learning for flexible output dimensions, allowing developers to scale from the default 3,072 dimensions down to optimize for performance and storage costs. According to Google, the model establishes state-of-the-art performance in multimodal tasks, outperforming leading competitors in text, image, and video benchmarks while introducing strong speech capabilities.
- Includes native audio processing without transcription and flexible output dimensions for balancing quality with storage costs
- Demonstrates state-of-the-art performance across text, image, and video tasks with support for over 100 languages
Editorial Opinion
Gemini Embedding 2 represents a meaningful leap forward in multimodal AI capabilities, addressing a real developer need by consolidating diverse data types into a single semantic space. The native support for audio without transcription and true interleaved input processing sets it apart from existing solutions. However, the model's real-world impact will depend on pricing, latency performance, and whether it meaningfully outperforms simpler pipelines that chain specialized models—metrics not fully detailed in this announcement.



