Primer on Vector Databases and Retrieval-Augmented Generation (RAG) using Langchain, Pinecone & HuggingFace

Published in

GoPenAI

9 min readAug 16, 2023

Vector Databases

Vector databases, also known as vector storage or vector databases, are specialized databases designed to store and retrieve vector data efficiently. In the context of computer science and artificial intelligence, a vector refers to an array or list of numerical values that represent a point in a multi-dimensional space. Each element in the vector corresponds to a specific dimension of the space.

These databases are beneficial for handling data with a natural vector representation, such as embeddings from deep learning models, numerical features extracted from various sources, or any other data that can be represented as vectors.

Vector databases typically provide efficient storage, indexing, and querying mechanisms optimized for vector data. Traditional relational databases are not well-suited for handling vector data efficiently, as they are designed primarily for tabular data with fixed columns. Vector databases, on the other hand, are designed to support high-dimensional and variable-length vectors, allowing for flexible data storage and retrieval.

Some key features of vector databases include:

1. Indexing: Vector databases use specialized indexing techniques, such as k-d trees, ball trees, or locality-sensitive hashing (LSH), to enable fast and efficient search operations over large datasets of vectors.

2. Similarity Search: One of the essential operations in vector databases is similarity search. Given a query vector, the database can efficiently find the closest vectors in the dataset based on a chosen distance metric (e.g., Euclidean distance or cosine similarity).

3. High-Dimensional Support: Vector databases are built to handle high-dimensional data effectively, where the number of dimensions can be much larger than traditional databases can handle.

4. Scalability: Many vector databases are designed to scale horizontally, allowing for efficient distributed storage and querying across multiple nodes or clusters.

5. Support for Embeddings: Vector databases are particularly popular in applications involving natural language processing (NLP) and computer vision, where embeddings from deep learning models are commonly used to represent textual or visual information.

Examples of vector databases include:

Annoy: An efficient C++ library for approximate nearest neighbour search.
Faiss: A library developed by Facebook AI Research for efficient similarity search and clustering of dense vectors.
Milvus: An open-source vector database powered by Faiss and designed for scalable vector similarity search.
Elasticsearch with vector similarity: Elasticsearch can be extended with plugins to support vector similarity search using specialized indexing techniques.
Weaviate is an open-source, cloud-native, and real-time vector search engine. Weaviate is built on the concept of a knowledge graph, where entities and their relationships are represented as vectors in a multi-dimensional space
Pinecone provides a scalable and fully managed vector database to store and index high-dimensional vectors efficiently. It is optimized for similarity search operations, allowing you to search for the most similar vectors based on distance metrics like Euclidean distance or cosine similarity.
Some others include ChromaDB, and Qdrant.

Vector databases have found applications in various fields, including recommendation systems, information retrieval, image and video search, anomaly detection, and more. They play a crucial role in accelerating the development of AI-driven applications by efficiently managing and searching high-dimensional vector data.

Retrieval-Augmented Generation (RAG)

Retrieval-augmented generation (RAG) is a natural language processing (NLP) approach that combines the benefits of both retrieval-based and generation-based methods for content generation tasks. It aims to improve the quality and controllability of the generation tasks by leveraging a pre-trained language model in conjunction with a retrieval mechanism.

In traditional language generation tasks, like text completion or question answering, generative models like GPT (Generative Pre-trained Transformer) have shown impressive capabilities in generating fluent and contextually relevant text. However, they may sometimes produce incorrect or inconsistent responses, especially when the input context is ambiguous or the data is scarce.

On the other hand, retrieval-based models can effectively retrieve relevant responses from a large database of pre-written responses or documents. They are often used in chatbots or information retrieval systems to provide accurate answers. However, they lack the creativity and flexibility of generative models, as they are constrained to the available responses in their database.

RAG combines the strengths of both approaches by integrating a retrieval mechanism into the generation process. It uses a two-step process:

1. Retrieval: The model first retrieves relevant information or context from a database using a traditional retrieval model (e.g., BM25 or TF-IDF). The retrieved information serves as additional context for the generation step.

2. Generation: Once the relevant context is retrieved, it is combined with the original input, and a generative language model (like GPT) is used to produce the final output. The generative model can now use both the original input and the retrieved context to generate coherent and contextually consistent responses.

Benefits of Retrieval-Augmented Generation:

1. Contextual Consistency: The retrieval step helps ensure that the generated output is contextually relevant and consistent with the provided input and the retrieved context.

2. Improved Accuracy: By retrieving relevant information from a database, RAG can provide more accurate responses, especially in information retrieval or question-answering tasks.

3. Controllability: The retrieval mechanism allows developers to control the scope of the generated responses by choosing the appropriate retrieval database or providing specific queries.

4. Reduced Generation Bias: By using retrieved-context, RAG can help mitigate some of the generation biases commonly observed in pure generative models.

Applications of Retrieval-Augmented Generation:

Chatbots: RAG can enhance chatbot responses by retrieving relevant answers from a knowledge base before generating a response.
Question Answering: RAG can improve question-answering systems by combining retrieval with a generative model to generate accurate and informative answers.
Document Summarization: RAG can be applied to generate more contextually relevant summaries by retrieving relevant sentences from source documents.

RAG is an exciting direction in NLP and has the potential to advance the state-of-the-art in various generation tasks by leveraging the benefits of both retrieval and generation models.

Blending both Superpowers together

Enough of jibber-jabber! Let’s get to some coding:

For demonstration purposes, I’ll be making use of Pinecone Vector DB which offers 1 free instance per account. Additionally, we’ll be leveraging LangChain to build a QnA pdf chatbot using Python.

Let’s see how we can set up a Pinecone Index:

Upon signing up, this is the homepage

From here we can go on creating an index, and select the necessary options for our application:

Next up we need to store API keys to call Pinecone APIs from our codebase

To connect Pinecone from Python, we would need the pinecone-client package.

Let us now go ahead and import the necessary libraries:

from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
import pinecone
from langchain.vectorstores import Pinecone
from langchain.embeddings import HuggingFaceEmbeddings

# Initializing Pinecone Vector DB
pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENV
)

# Pinecone Vector DB index name
index_name = 'langchain-demo'
index = pinecone.Index(index_name)

loader = TextLoader("PLACE FILE PATH HERE")
docs = loader.load()

This loads the document in Langchain reader format and outputs the following:

The Langchain Document Loaders TextLoader module helps in loading text data sources.

text_splitter = CharacterTextSplitter(
        chunk_size=1000,      # Specify chunk size
        chunk_overlap=200,    # Specify chunk overlap to prevent loss of information
    )

docs_split = text_splitter.split_documents(docs)

In this above snippet, we need to specify chunk_size which is the number of characters from the document that will be selected and used in the next step to convert to embedding. chunk_overlap helps in retaining previously chunked characters so that the sentence is pertained.

embeddings = HuggingFaceEmbeddings()

# create new embedding to upsert in vector store
doc_db = Pinecone.from_documents(
          docs_split,
          embeddings,
          index_name=index_name
        )

Here I’ve used, HuggingFace Sentence Transformer sentence-transformers/all-minilm-l12-v2 model to convert text to embeddings. Embeddings are numerical representations of words in a high-dimensional space or vectors and how they can be used to measure similarity between different pieces of text.

The above snippet helps in loading the embedded vector onto the Pinecone cloud database.

query = "PLACE USER QUERY HERE"

# search for matched entities and return score
search_docs = doc_db.similarity_search_with_score(query)

After data is uploaded run the similarity search on the user query. This returns metadata containing the source and score for the matched entity.

Similarity Search or Semantic Search relies more on word meaning rather than just keyword-based search. With the advent of advanced NLP techniques specifically transformers & attention mechanisms, we’re better able to understand how words are related in a sentence. This is based on a self-attention mechanism which works on query, key and value vectors.

Query (Q): This represents the element’s information that we want to use to weigh other elements.
Key (K): This represents the information in other elements that we’ll use to weigh the current element.
Value (V): This holds the actual content or information of the current element.

Up until now, we were able to implement retrieval augmentation. Now we will look into the generation part using an LLM.

from langchain import HuggingFaceHub

repo_id = "tiiuae/falcon-40b" llm = HuggingFaceHub(repo_id=repo_id, model_kwargs={
                                  "temperature":0, 
                                  "max_length":64})

I’ve used HuggingFaceHub model Falcon-40b which is an open-source LLM model. We could make use of other models such as OpenAI and others.

from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type='stuff',
    retriever=doc_db.as_retriever(),
)query = "USER QUERY"
result = qa.run(query)result

'The total revenues for the full year 2022 were $282,836 million, with operating income and operating margin information not provided in the given context.'

Using Langchain retriever and chain modules we are able to prompt the LLM with the embedded vector information to give us structured information related to our document, which makes it human-readable and more formatted than the previous retrieved output.

Conclusion

The era of generative AI has opened up many capabilities to existing systems and here is one we looked at — Vector databases and retrieval augmented generation. This has been a basic overview of one such problem approach there’s much more that can be done around it like building AI agents to process any kind of data be it text, image, video or audio. RAG & vector databases solve the problem of long context window LLMs and bring reasoning from a historical knowledge base.

References

What is a Vector Database? | Pinecone

Vector databases have the capabilities of a traditional database that are absent in standalone vector indexes and the…

www.pinecone.io

Semantic Search

Semantic Search In this walkthrough we will see how to use Pinecone for semantic search. To begin we must install the…

docs.pinecone.io

Semantic Search: Measuring Meaning From Jaccard to Bert | Pinecone

Similarity search is one of the fastest-growing domains in AI and machine learning. At its core, it is the process of…