RAG for Everyone: A Beginner’s Guide to Embedding & Similarity Search

Sourav Bhattacharjee
GoPenAI
Published in
10 min readAug 29, 2023

--

In my last article I asked a question at the end. Did you find the answer? Don’t worry if you could not. I will explain the whys and also explain in detail on how to solve such issues in this article

The Why

Was not it amusing that while most of us know the answer to the question, why our LLM couldn’t answer the same? This is because although it seems like our LLM knows everything but actually it can answer the facts on the data on which it was trained. As you know that LLMs are trained on huge data set coming from open internet. And our model was last trained on 2021. So it is obvious that it has no idea what happened in 2022. And that is the reason it failed

The How

Now you must be wondering, this looks scary in case we want to run an LLM on our data which may be changing everyday. How can the LLM even know about it? The good news is there are ways for the LLM to generate answers based on your own data. One such technique is RAG → Retrieval Augmented Generation. We will talk in detail how to implement RAG and try to find out the answer of our failed question

RAG — In-Context injection

We have already learnt in few shot examples on how to make the model answer to a question using some examples and contexts. So what if we can give the model the context from our data and ask it to answer based on the context only. It will surely be able to answer. So you can probably send all the data you have, to set the context, to the LLM model. This way, LLM would have found appropriate place to search and then give the answer.

But here is the catch. Every model has a limit of tokens it can process. For example, The token limit for gpt-35-turbo is 4096 tokens. This limit includes the token count from both the prompt and completion. So it is obvious you cannot send all your data as part of the prompt at once. You have to send only the most relevant data (out of all the huge data you have) to the model so that it can learn from the context and take actions accordingly. Now the biggest question is how to find that relevant information.

Enters Embedding

Embedding is the technique we can follow to mitigate the previous problem. But what is embedding?

Basically this large language models are complex neural networks and as you may know neural networks only knows how to play with the numbers. All the inputs and outputs from a neural network is mostly numbers. But here we have plain english texts. So how do this texts enter the neural network? One of the techniques is embedding

Embedding is a way of representing data, almost any kind of data, like text, images, videos, users, music, whatever as points in space where the locations of those points in space are semantically meaningful.

In other words, embeddings are a way of taking complex data and representing it in a simpler way that is easier for machines to understand. This is important for Gen AI because it allows machines to learn from and understand the world around them in a more meaningful way.

Embeddings can help machines to learn the relationships between different pieces of data. For example, an embedding for the word “dog” might be close to the embeddings for the words “cat”, “pet”, and “animal”. This shows that the machine understands that these words are related.

Now, in scientific term embedding is about representing a text as vectors in n-dimensional space.

Some Examples

Lets try and find some examples. We will use OpenAI embedding model to embed few words and see how the word’s vector representation looks like. This examples are the baseline and we will use it throughout this post

Let’s install the required packages and import them

pip install openai langchain numpy pandas
from langchain import PromptTemplate
from langchain.llms import OpenAI
import os
import numpy as np
from numpy.linalg import norm
import pandas as pd
import openai

Next, we are going to initialise the program to use open ai API key

openai.api_key = os.getenv("OPENAI_API_KEY")

Next, we will create a function which will give the embeddings of a particular word using open ai embedding model

def get_embedding(text, model="text-embedding-ada-002"):
text = text.replace("\n", " ")
return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

Now, lets use this method to get the embeddings

word = "cat"
word_embedding = get_embedding(text=word, model="text-embedding-ada-002")
df_word = pd.DataFrame({'embed': word_embedding})
df_word.shape

If you run the program you will see the following output

(1536, 1)

That means it created a vector representation of the word cat in 1536 dimensional space. Now lets examine the data by printing the pandas data frame

        embed
0 -0.007023
1 -0.017333
2 -0.009632
3 -0.030720
4 -0.012500
... ...
1531 0.010498
1532 0.020045
1533 -0.014310
1534 -0.023423
1535 -0.014203

[1536 rows x 1 columns]

As you can see, this has transformed the word “cat” into this vector representation. Similarly if we run it for another animal, say “dog”, this will again give some vector representation. Let’s examine

         embed
0 -0.003393
1 -0.017721
2 -0.015950
3 -0.017477
4 -0.018059
... ...
1531 0.027602
1532 0.031251
1533 -0.006177
1534 0.004562
1535 -0.019099

[1536 rows x 1 columns]

Now, since the words are represented as vector of numbers, we can do mathematical operation on it. And that’s what a neural network (deep learning model) wants

Back to the initial problem

Now that we know a little bit on how to represent texts, we will try to find out on how to find out the most relevant information from a huge data volume. For simplicity purpose we will start with a very simple example and we will use the same functions as before

We will start with creating some made up sentences and try to find the relevant sentence when we ask a question.

contexts = ["I have a dog. My dog's name is Jimmy", 
"I have a cat. My cat's name is biscuit",
"My dog who likes to listen to music",
"My cat likes to play with cricket balls "]

df_context = pd.DataFrame({'context': contexts})
df_context['embedding'] = df_context.context.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
print(df_context)

Again, like before it will print the dataframe. But this time it contains both the sentence and it’s vector representation within a single row

                                    context  \
0 I have a dog. My dog's name is Jimmy
1 I have a cat. My cat's name is biscuit
2 My dog who likes to listen to music
3 My cat likes to play with cricket balls

embedding
0 [-0.02697647735476494, -0.012099206447601318, ...
1 [-0.035566531121730804, -0.00652551231905818, ...
2 [-0.017062265425920486, -0.007684292271733284,...
3 [-0.008196039125323296, 0.006231278646737337, ...

Now some question time. We will ask the question “What is my cat’s name?” and represent this question again as a vector representation by using embedding techniques

question = "What is my cat's name?"
question_embedding = get_embedding(text=question, model="text-embedding-ada-002")
df_embedding = pd.DataFrame({'embed': question_embedding})
df_embedding
embed
0 -0.007904
1 0.002608
2 -0.007347
3 -0.012890
4 -0.020071
... ...
1531 0.000499
1532 0.011055
1533 -0.015004
1534 -0.012322
1535 -0.044690

Now we have both our data and the question in embedded vector format. We will do some similarity checks between these vectors using simple mathematics (Think about matrix dot product in Engineering Math :) )

Find Similarities

One of the widely used algorithm to find vector similarity is cosine similarity. We will use the same here. The formula for cosine similarity is as below

(x, y) = x . y / ||x|| * ||y||
where,

x . y = product (dot) of the vectors ‘x’ and ‘y’.
||x|| and ||y|| = length (magnitude) of the two vectors ‘x’ and ‘y’.
||x|| * ||y|| = regular product of the two vectors ‘x’ and ‘y’.

Lets implement this in python.

cos_sim = []
for index, row in df_context.iterrows():
x = row.embedding
y = question_embedding
# calculate the cosine similiarity
cosine = np.dot(x,y)/(norm(x)*norm(y))

cos_sim.append(cosine)

df_context["cos_sim"] = cos_sim
print(df_context)

This will calculate the cosine similarity between the question and the data we had. Let’s examine the output

                                    context  \
0 I have a dog. My dog's name is Jimmy
1 I have a cat. My cat's name is biscuit
2 My dog who likes to listen to music
3 My cat likes to play with cricket balls

embedding cos_sim
0 [-0.02697647735476494, -0.012099206447601318, ... 0.828285
1 [-0.035566531121730804, -0.00652551231905818, ... 0.870383
2 [-0.017062265425920486, -0.007684292271733284,... 0.793400
3 [-0.008196039125323296, 0.006231278646737337, ... 0.830946

Check the cos_sim value in the output. As you see, “I have a cat. My cat’s name is biscuit”, this sentence got the highest cos_sim value. So vector similarity has determined this is the most relevant information out of all the data present for the given question “What is my cat’s name?”. Once we find this information, we can easily inject this relevant context in our prompt and LLM should be able to answer questions from this prompt.

Is not this beautiful?

Back to the 2022 problem

Now let us try to solve the original problem we had — “Who won the fifa world cup 2022?”

We will do some web scrapping here. The website will work as our source of data. And then from the data we will create small chunks of texts within a small limit so that the token limit does not exceed the prompt. And then we will use the above technique to find out most relevant texts for the question and will send the same to the LLM to generate the answer. We will also use Langchain for prompting and for text chunking. Ok, lets start

We will start by importing few more important modules

import requests
from bs4 import BeautifulSoup
from langchain.text_splitter import RecursiveCharacterTextSplitter

BeautifulSoup will be used as web scrapping framework. If this is not installed, please install it using pip

We will use the following url as our source data: https://olympics.com/en/news/fifa-world-cup-winners-list-champions-record

headers = { 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0' }
url = 'https://olympics.com/en/news/fifa-world-cup-winners-list-champions-record'

We will read the data from website and parse the text

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

# Find all the text on the page
text = soup.get_text()

Now time to split the texts into small chunks using Langchain

text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size = 500,
chunk_overlap = 100,
length_function = len,
)

texts = text_splitter.create_documents([text])

Our data is prepared and now it’s the time to play with the embedding magic

text_chunks=[]
for text in texts:
text_chunks.append(text.page_content)
df = pd.DataFrame({'text_chunks': text_chunks})
print(df.head(5))
df['ada_embedding'] = df.text_chunks.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))

Calculate the embedding of user input

users_question = "Who was the winner in Fifa 2022 world cup?"
question_embedding = get_embedding(text=users_question, model="text-embedding-ada-002")
df_question = pd.DataFrame({'embed': question_embedding})

Like before, we will calculate the cosine similarity

cos_sim = []
for index, row in df.iterrows():
A = row.ada_embedding
B = question_embedding
# calculate the cosine similiarity
cosine = np.dot(A,B)/(norm(A)*norm(B))

cos_sim.append(cosine)

df["cos_sim"] = cos_sim

Now we will build the prompt using the relevant context

llm = OpenAI(temperature=0)

# define the context for the prompt by joining the most relevant text chunks
context = ""

for index, row in df[0:5].iterrows():
context = context + " " + row.text_chunks

# define the prompt template
template = """
You are a sports journalist! Given the following context sections, answer the
question using only the given context. If you are unsure and the answer is not
explicitly written in the context, say "Sorry, I am a bad journalist!"

Context sections:
{context}

Question:
{users_question}

Answer:
"""

prompt = PromptTemplate(template=template, input_variables=["context", "users_question"])

# fill the prompt template
prompt_text = prompt.format(context = context, users_question = users_question)

We are all set to check the response

res = llm(prompt_text)
print(res)

And the output is

Argentina was the winner in Fifa 2022 world cup.

Of course, it is Argentina. Now it knows the context and can find the answer from the context.

So now we know, how to use RAG technique to inject the context relevant to our own data and use the power of LLM to generate the output from our own data. This is fascinating. But are we done? Not yet.

The problem

The problem with the above approach is that we have to do all the steps every time we want to initiate a integration with the LLM. So we have to do web scrapping, embedding, cosine similarity and these are costly (think of time consuming) operations for a large dataset. So how to overcome that

The Solution

The solution is to save the vector representation in a database and then do the similarity check from that database. There are a number of vector stores available for this purpose. Some of the popular ones are

Chroma
Pinecone
Weaviate
Milvus
Redis
pgvector

In our next article, we will see how can we use a Vector store to store the vectors and get the results quickly. We will use Pinecone in our example to store the resulting vectors and then will perform a similarity search to find out the relevant piece of information. We will use the same example as above.

Conclusion

As you see in the example above, we have successfully used a technique called, RAG to inject the context into the prompt so that LLM can be used as a completion tool to our own data. We will further simplify this with the help of a vector store in our next example. But I would like to share a word of caution on RAG. Many of the places it is written that Vector Similarity Search + LLM = RAG. But this is not correct. RAG = Information Retrieval + LLM. Here is a nice article you can read

Bonus Item

I have prepared one colab file which has the full code described in the blog. You can check the same in the below link

https://colab.research.google.com/drive/1tgpmnvs7yCBHtllrfYHtoSgcQfGZhYMI?usp=sharing

--

--