Improved RAG with Llama3 and Ollama

Published in

GoPenAI

5 min readApr 19, 2024

In this article we will see on how to implement an advanced RAG with fully local infrastructure leveraging the most advanced openly available Large Language Model Llama-3 from meta, which was released yesterday. This article serves a firsthand cookbook for Day-1 implementation of advanced RAG using Llama-3.

Introduction:

In this article we will create an advanced RAG that will answer user queries based on a research paper that is given as input to the pipeline. The technology stack that is used in constructing this pipe is as below.

Ollama Embedding modelmxbai-embed-large
Ollama quantized Llama-3 8b model
locally hosted Qdrant vector database.

With this setup its clearly evident 2 things on the cost incurred is absolute 0 and the information is highly secure and private.

What id HyDE ?

HyDE, or Hypothetical Document Embeddings, emerges from the innovative work laid out in the 2022 paper by Gao et al. titled “Precise Zero-Shot Dense Retrieval without Relevance Labels.” The primary goal of this research endeavor was to enhance zero-shot dense retrieval, which relies on semantic embedding similarities. The solution presented, HyDE, operates through a two-step methodology.

In the initial step, known as Step 1, a language model — specifically exemplified by GPT-3 — is directed through instruction prompting to generate a hypothetical document based on the original query. This process is carefully tailored to a question posed within the paper, ensuring relevance despite the hypothetical nature of the document.

Moving to Step 2, the generated hypothetical document undergoes transformation into an embedding vector through the utilization of a Contriever, characterized as an “unsupervised contrastive encoder.” This encoder facilitates the conversion of the hypothetical document into a vector representation, which is then utilized for subsequent similarity search and retrieval tasks.

HyDE fundamentally functions by transmuting documents into vector embeddings via two pivotal components. The initial facet involves a generative task employing a language model, aimed at capturing relevance even within hypothetical documents, acknowledging the potential for factual inaccuracies. Subsequently, a document-document similarity task managed by a contrastive encoder refines the embedding process, filtering out extraneous details and enhancing efficiency.

Notably, HyDE surpasses the performance of existing unsupervised dense retrievers such as Contriever. Furthermore, it exhibits comparable performance to fine-tuned retrievers across diverse tasks and languages. This methodological approach condenses dense retrieval into two coherent tasks, signifying a notable advancement in semantic embedding-based retrieval methodologies.

Implementation:

from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    StorageContext,
    Settings,
    get_response_synthesizer)
from llama_index.core.query_engine import RetrieverQueryEngine, TransformQueryEngine
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import TextNode, MetadataMode
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
import qdrant_client
import logging

Initializations:

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# load the local data directory and chunk the data for further processing
docs = SimpleDirectoryReader(input_dir="data", required_exts=[".pdf"]).load_data(show_progress=True)
text_parser = SentenceSplitter(chunk_size=512, chunk_overlap=100)

text_chunks = []
doc_ids = []
nodes = []

creating the vector store to push the embeddings.

# Create a local Qdrant vector store
logger.info("initializing the vector store related objects")
client = qdrant_client.QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(client=client, collection_name="research_papers")

local embedding and LLM models

# local vector embeddings model
logger.info("initializing the OllamaEmbedding")
embed_model = OllamaEmbedding(model_name='mxbai-embed-large', base_url='http://localhost:11434')
logger.info("initializing the global settings")
Settings.embed_model = embed_model
Settings.llm = Ollama(model="llama3", base_url='http://localhost:11434')
Settings.transformations = [text_parser]

Creating the nodes, vector store, HyDE transformer and finally querying

logger.info("enumerating docs")
for doc_idx, doc in enumerate(docs):
    curr_text_chunks = text_parser.split_text(doc.text)
    text_chunks.extend(curr_text_chunks)
    doc_ids.extend([doc_idx] * len(curr_text_chunks))

logger.info("enumerating text_chunks")
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(text=text_chunk)
    src_doc = docs[doc_ids[idx]]
    node.metadata = src_doc.metadata
    nodes.append(node)

logger.info("enumerating nodes")
for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode=MetadataMode.ALL)
    )
    node.embedding = node_embedding

logger.info("initializing the storage context")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
logger.info("indexing the nodes in VectorStoreIndex")
index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
    transformations=Settings.transformations,
)

logger.info("initializing the VectorIndexRetriever with top_k as 5")
vector_retriever = VectorIndexRetriever(index=index, similarity_top_k=5)
response_synthesizer = get_response_synthesizer()
logger.info("creating the RetrieverQueryEngine instance")
vector_query_engine = RetrieverQueryEngine(
    retriever=vector_retriever,
    response_synthesizer=response_synthesizer,
)
logger.info("creating the HyDEQueryTransform instance")
hyde = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(vector_query_engine, hyde)

logger.info("retrieving the response to the query")
response = hyde_query_engine.query(
    str_or_query_bundle="what are all the data sets used in the experiment and told in the paper")
print(response)

client.close()

The above code begins by configuring logging for INFO level messages so that every log can be seen in the output, and then proceeds to load PDF data from a local directory, splitting it into text chunks. It sets up a Qdrant vector store for storing research paper embeddings and initializes an Ollama text embedding model for generating embeddings from the text. Global settings are configured, and text chunks are processed and associated with document IDs. Text nodes are created from the chunks, preserving metadata, and embeddings are generated for these nodes using the Ollama model. The script then sets up a storage context for indexing the text embeddings in the Qdrant vector store and proceeds to index them. A vector retriever is configured for retrieving similar embeddings, and a query engine is initialized for handling queries. A HyDE query transformation is set up for enhanced query processing. Finally, a query is executed to retrieve information about data sets mentioned in the paper’s experiment, and the response is printed.

Output:

Conclusion:

In conclusion, by harnessing the power of cutting-edge technologies such as the Llama-3 Large Language Model from Meta, alongside sophisticated methodologies like HyDE, and leveraging the capabilities of Ollama, we are poised to construct unparalleled RAG pipelines. Through meticulous fine-tuning of crucial hyperparameters such as top_k, chunk_size, and chunk_overlap, we can ascend to new heights of accuracy and efficacy. This fusion of advanced tools and meticulous optimization promises to unlock the full potential of our systems, paving the way for groundbreaking advancements and ensuring our solutions stand at the forefront of innovation and excellence with utmost privacy and security.

Improved RAG with Llama3 and Ollama

Introduction:

What id HyDE ?

Implementation:

Conclusion:

Written by M K Pavan Kumar