[Milvus]: Scalable Vector Search for AI
Explore Milvus, the leading open-source vector database for efficient similarity search and AI applications.

The era of siloed data search is over; multimodal AI is here. For too long, our ability to extract knowledge from vast digital archives has been hampered by the inherent limitations of single-modality search. Text documents could be indexed and queried, images could be searched by tags or basic OCR, but bridging the gap between these distinct data types was a developer’s nightmare, demanding intricate, custom-built RAG (Retrieval-Augmented Generation) pipelines. This fragmentation led to incomplete answers, missed insights, and a frustratingly manual effort to synthesize information scattered across formats.
Then, on May 5, 2026, Google announced a pivotal shift with the Gemini API’s enhanced File Search capabilities, now embracing full multimodality. This isn’t just an incremental update; it’s a declaration of war on data silos. By seamlessly integrating text and image understanding into a managed search service, Gemini is dramatically simplifying RAG development, particularly for applications that need to reason across both visual and textual information.
The core innovation lies in Gemini’s ability to generate unified embeddings for both text and images. This is powered by the Gemini Embedding 2 model, a significant leap from its predecessor, gemini-embedding-001, which was text-only. Gemini Embedding 2 can process and understand the semantic content of images and correlate it with textual descriptions or embedded text within those images. This means a query like “Show me all project proposals that mention the ‘Blue Heron’ initiative, even if the initiative is only depicted in a flowchart within the document” is now within reach without writing complex custom logic.
Google’s managed service abstracts away much of the heavy lifting traditionally associated with RAG. For developers, this translates to:
chunking_config offers customization like max_tokens_per_chunk, allowing fine-tuning of retrieval granularity.Let’s look at a simplified Python snippet illustrating the creation of a multimodal store and querying:
from google.generativeai.client import get_default_generative_model_client
# Initialize the client with the multimodal embedding model
client = get_default_generative_model_client(
client_options={"api_key": "YOUR_API_KEY"}
)
# Create a new search store
store_name = "my-multimodal-knowledge-base"
response = client.create_search_store(
display_name=store_name,
embedding_config={"embedding_model": "models/gemini-embedding-2"} # Crucial for multimodality
)
print(f"Created store: {response.name}")
# Upload files to the store
file_path_text = "path/to/your/document.pdf"
file_path_image = "path/to/your/diagram.png"
response_text = client.upload_search_document(
search_store_id=response.name,
content=open(file_path_text, "rb").read(),
mime_type="application/pdf",
display_name="Project Proposal v3"
)
print(f"Uploaded text document: {response_text.name}")
response_image = client.upload_search_document(
search_store_id=response.name,
content=open(file_path_image, "rb").read(),
mime_type="image/png",
display_name="System Architecture Diagram"
)
print(f"Uploaded image: {response_image.name}")
# Query the store with multimodal understanding
query_text = "What are the proposed KPIs for the 'Phoenix Project' based on the latest proposal, and how are they visually represented in the architecture diagram?"
response_query = client.generate_content(
query_text,
tool_config={"fileSearch": {"search_store": response.name}}
)
print(response_query.text)
This example showcases how succinctly you can initiate a multimodal RAG process. The tool_config={"fileSearch": {"search_store": response.name}} is where the magic happens, directing the generate_content call to leverage the multimodal search store.
The sentiment surrounding Gemini’s multimodal file search has been largely positive, with many developers echoing the sentiment that it “kills multimodal RAG” by drastically reducing complexity. The promise of a unified, managed solution for integrating text and image search into LLM applications is incredibly compelling. It democratizes access to sophisticated RAG capabilities, allowing smaller teams and individual developers to build richer, more intelligent applications without the prohibitive overhead of building and maintaining custom infrastructure.
However, the ecosystem is not without its nuances and criticisms. Transparency around API usage costs, particularly for embedding generation and storage, remains a recurring concern. While the managed service simplifies development, it also introduces a layer of abstraction that can obscure the underlying economics. Some critics also feel that Google’s managed service, while powerful, might lag behind the granular control offered by more established, albeit complex, custom RAG pipelines built on platforms like Pinecone or Supabase.
Despite its impressive advancements, it’s crucial to understand the limitations and identify scenarios where Gemini’s multimodal file search might not be the optimal choice.
When to Reconsider:
Gemini API’s multimodal file search represents a significant and pragmatic leap forward for developers building RAG-powered applications. It dramatically lowers the barrier to entry for creating intelligent search systems that can understand and reason across both text and images. The managed service, coupled with verifiable citations, offers an attractive blend of ease-of-use, cost-effectiveness (compared to building from scratch), and enhanced trustworthiness.
For rapid prototyping, building internal knowledge bases, customer support bots that can understand screenshots, or content analysis applications that benefit from cross-modal understanding, Gemini’s File Search is a compelling and powerful tool. It empowers developers to move beyond the limitations of single-modality search and unlock deeper insights from their data. However, it is not a panacea. For highly specialized, deeply customized, or enterprise-grade RAG implementations with strict requirements for control, specific document management features, or advanced visual analytics, a more bespoke approach may still be necessary. Nevertheless, for a vast majority of use cases, this advancement signals a new, more intelligent era of file search.