Understanding Vector Databases: The Foundation of Modern AI Applications

By Deepita Pai and Tejaswi Agarwal on

May 15, 2024

Hey there, tech enthusiasts! Welcome to our cozy corner of the digital universe. Today, we’re exploring vector databases—those unsung heroes that quietly power our favorite AI applications. So, grab your virtual coffee, and let’s dive in!

What Are Vector Databases Anyway?

Vector databases form the backbone of machine learning and artificial intelligence applications. Unlike traditional databases that deal with structured data, these databases store and manage large amounts of high-dimensional data in a vector embedding format, enabling efficient storage, retrieval, and processing of complex information.

But what’s a vector, you ask? A vector in the context of a vector database refers to an embedding-based representation of an object, such as images, audio, and text, commonly used in machine learning tasks. These vector representations are nothing but high-dimensional numerical arrays that capture the essential features or characteristics of the objects. Let's take an example of an image of a cat. The vector representation of such an image could capture features such as the shape of its ears, the color of the fur, the color of its eyes, the pattern on its coat, and the size of its whiskers. These vector embeddings can be generated using Deep Neural Network models such as Convolutional Neural Networks for images and word2vec and BERT for textual data.

Use cases for Vector Databases

example of convolutional neural network So far, we've learned that vector databases can be used to store the high-dimensional embedding of any object. But how are these vector databases useful? When prompted with a search query of a vector representation of an image or audio, vector databases can quickly retrieve embeddings from the database in a way similar to the prompt query. Vector DBs use models such as the Approximate K-Nearest Neighbor approach (internally using similarity methods such as cosine similarity or Euclidean distance) to find similarities between embeddings.

Image Processing and Speech Recognition

Industrial examples of Vector DB for image search include Amazon's usage of OpenSearch service [1]. Amazon uses the OpenSearch Vector Search Collection as a vector database for image searches, enabling users to query the search engine with rich media like images. The implementation is similar to semantic search, where deep learning models, such as ResNets, are used to convert images into vector embeddings. OpenSearch provides efficient vector similarity search by offering specialized indexes and supports a scalable engine that can handle vector search at low latency and up to billions of vectors.

Another example is Spotify's Voyager released in December 2023. Voyager is an open-sourced Vector DB for Speech Recognition that enables similarity search of music tracks on in-memory collections of vectors, succeeding Annoy as Spotify’s recommended nearest-neighbor search library for production use. This allows Spotify to recommend new songs to users based on their listening preferences and also helps identify and eliminate duplicate tracks.

Chatbots and RAGs

The growth of chatbots has seen a major boost after the advancement of generative LLM models like OpenAI's GPT-3 and Facebook's Llama. Generative AI has enabled chatbots to engage in more natural and contextually relevant conversations, providing personalized experience for users and in some cases has been observed to be more efficient at problem resolution than a human agent.

However, one of the key challenges in building chatbots is ensuring that they provide accurate and relevant responses to user queries. This is where RAGs come into play. Retrieval Augmented Generation (RAG) is a method used to enhance the reliability of generative AI chatbots. Essentially, RAG combines the power of generative models and an external knowledge base to improve the quality and relevance of responses generated by chatbots. By integrating with an external knowledge base, RAG addresses the issue of "hallucinations" in generative LLMs—cases where the model produces a plausible but incorrect answer. This can occur when, for example, you ask ChatGPT to create an itinerary of your Barcelona trip, and it tells you to visit imaginary museums that do not even exist!

One really cool application of this is Stack Overflow's implementation of an intuitive search experience where they modeled Stack Overflow questions and answers as embeddings using a pre-trained BERT model and used Weaviate, an open-source Vector DB, for storage, retrieval, and fetching similarity between user-provided search queries and Stack Overflow results.

Recommender Systems

Ever wondered how e-commerce websites can recommend products so precisely personalized to your taste that you end up ordering items you didn't even know you wanted? E-commerce websites use product embeddings to personalize product recommendations. These embeddings are created based on the characteristics and relationships of the products and the order history of millions of other users.

In the embedding space, items that are frequently purchased together or share similar features are placed closer to each other, indicating a higher similarity between them. The types of data used for creating these embeddings can include purchase activity or co-rating similarity, where products rated similarly by users are considered alike.

Once the embedding model is trained, it can be utilized to generate personalized recommendations. When a user interacts with the system, their behavior and preferences are used to generate the user's embedding. Amazon uses OpenSearch vector DB [3] to store all product embeddings and find similarities between the user's embedding and those of products in the database. Products with embeddings closest to the user's encoding are then recommended accordingly.

Examples of Vector Databases

So far, we learned what vector databases are and how they're used. Now let's explore some examples of vector databases:

Open Source Vector-only Databases

Chroma: Chroma is an open-source vector database designed for efficient storage and retrieval of high-dimensional vector data. It provides support for similarity search and is suitable for a wide range of applications, including recommendation systems and content-based search. Using Chroma for vector embeddings is as simple as doing `pip install chroma` and running the Chroma server.
Vespa: Vespa is an open-source big data processing and serving engine that includes support for handling and querying high-dimensional vector data. It is used by various organizations for tasks such as personalized content recommendations, question-answering systems, and typeahead suggestions.
Milvus: Milvus is an open-source vector database optimized for similarity search and recommendation systems. It provides a range of features for managing large-scale vector data, including indexing, storage, and query processing. Milvus is used by companies such as Nvidia and Roblox.

Source Code Available and Commercial Vector-only Databases

Pinecone: Pinecone is a vector database that provides both open-source code and a commercial platform for managing high-dimensional vector data. It offers features such as real-time indexing, similarity search, and numerous integrations with machine-learning workflows. Developers can access Pinecone using cloud platforms like AWS, GCP, and Azure. They can also generate vector embeddings with third-party integrations from Hugging Face, Databricks, and Cohere while monitoring usage and performance using DataDog and NewRelic.
Weaviate: Weaviate is an open-source knowledge graph system that includes support for managing and querying high-dimensional vector data. It offers features for semantic search, recommendation systems, and natural language processing applications. Weaviate is used by companies such as Stack Overflow to build an intuitive search experience.

Databases with Vector Search Capabilities

Several traditional databases have added support for handling and querying high-dimensional vector data, including:

Open Search: Formerly known as Open Distro for Elasticsearch, Open Search includes support for vector similarity search using the k-NN (k-nearest neighbors) algorithm. Developers can also choose to use Amazon OpenSearch Service which is an AWS-managed service that lets you run and scale OpenSearch clusters without having to worry about managing, monitoring, and maintaining your infrastructure, or having to build in-depth expertise in operating OpenSearch clusters.
PostgreSQL: PostgreSQL has introduced support for indexing and querying high-dimensional vectors using extensions such as pgvector that allow developers to store, query, and index vectors.
Cassandra: Apache Cassandra includes support for storing and querying high-dimensional vector data, making it suitable for use cases such as time series analysis and recommendation systems.
Redis: Redis can also be used as a vector database for use cases such as recommendation systems, Retrieval Augmented Generation (RAG), and LLM semantic caching. For instance, Superlinked utilizes Redis Enterprise's Vector Database [4] to build products for personalized recommendations and relevant content.
SingleStore: SingleStore offers support for handling high-dimensional vector data through its distributed SQL database platform, enabling real-time analytics and machine learning applications.

Industry-Built Vector Search Systems

Spotify's Voyager: Voyager [5] is a large-scale distributed search and recommendation system built by Spotify to handle high-dimensional music embeddings. It enables tasks such as music recommendations, personalized playlists, and de-duplication of music tracks.
Pinterest's Pintext: Pintext [6] is a multitask text embedding system developed by Pinterest to handle high-dimensional text representations. It powers various text-based applications within Pinterest, including content recommendation and semantic search.

Conclusion

In this era of AI-enabled applications, vector databases form the backbone of numerous applications that we interact with on a daily basis. These include music recommendations in our playlists, chatbots that can be AI assistants that can solve our problems, and human agents. But there's so much more! The synergy between vector databases and Deep Learning models sets the stage for a future where AI truly understands the nuances of human language, visual input, and sound. Imagine:

Medicine: I'm excited for the future of medical image analysis, where advancements in vector databases can detect abnormalities in X-ray images even before the disease gets progressively worse, aiding doctors in identifying subtle anomalies with greater precision and faster than ever before.
Accessibility: With the advancement in deep learning models and the robustness of vector databases, I'm hopeful for a future where we can help generate detailed image descriptions in real-time for the visually impaired, opening up a whole new world of visual understanding.
Education: Imagine a history chatbot that goes beyond reciting facts but draws connections between historical events, timelines, and figures, using vast amounts of text and images in its knowledge base – creating a truly immersive learning experience. This can be truly game-changing for students, making education more engaging and interactive.

The potential is immense, and the possibilities are truly limitless. As vector databases and deep learning models continue to become more robust and refined, the lines between what machines can understand and how humans interact with information will continue to blur. We're one step closer to a future where AI-powered applications augment our abilities, leading to a better understanding of our world.

References

[1] "hrgrapevine.com," hrgrapevine.com, (Accessed: 2 Mar. 2024). [2] stackoverflow.blog, (Accessed: 2 Mar. 2024). [3] aws.amazon.com, (Accessed: 2 Mar. 2024). [4] redis.com, (Accessed: 2 Mar. 2024). [5] "Voyager". [6] J. Zhuang and Y. Liu. "PinText: A Multitask Text Embedding System in Pinterest". Jul. 2019. 10.1145/3292500.3330671. [7] miro.medium.com

Disclaimer: The author is completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE's position nor that of the Computer Society nor its Leadership.