For additional ideas on how to improve the performance of your RAG pipeline to make it production-ready, continue reading here:
This section discusses the required packages and API keys to follow along in this article.
Required Packages
This article will guide you through implementing a naive and an advanced RAG pipeline using LlamaIndex in Python.
pip install llama-index
In this article, we will be using LlamaIndex v0.10
. If you are upgrading from an older LlamaIndex version, you need to run the following commands to install and run LlamaIndex properly:
pip uninstall llama-index
pip install llama-index --upgrade --no-cache-dir --force-reinstall
LlamaIndex offers an option to store vector embeddings locally in JSON files for persistent storage, which is great for quickly prototyping an idea. However, we will use a vector database for persistent storage since advanced RAG techniques aim for production-ready applications.
Since we will need metadata storage and hybrid search capabilities in addition to storing the vector embeddings, we will use the open source vector database Weaviate (v3.26.2
), which supports these features.
pip install weaviate-client llama-index-vector-stores-weaviate
API Keys
We will be using Weaviate embedded, which you can use for free without registering for an API key. However, this tutorial uses an embedding model and LLM from OpenAI, for which you will need an OpenAI API key. To obtain one, you need an OpenAI account and then “Create new secret key” under API keys.
Next, create a local .env
file in your root directory and define your API keys in it:
OPENAI_API_KEY="<YOUR_OPENAI_API_KEY>"
Afterwards, you can load your API keys with the following code:
# !pip install python-dotenv
import os
from dotenv import load_dotenv,find_dotenvload_dotenv(find_dotenv())
This section discusses how to implement a naive RAG pipeline using LlamaIndex. You can find the entire naive RAG pipeline in this Jupyter Notebook. For the implementation using LangChain, you can continue in this article (naive RAG pipeline using LangChain).
Step 1: Define the embedding model and LLM
First, you can define an embedding model and LLM in a global settings object. Doing this means you don’t have to specify the models explicitly in the code again.
- Embedding model: used to generate vector embeddings for the document chunks and the query.
- LLM: used to generate an answer based on the user query and the relevant context.
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.settings import SettingsSettings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
Settings.embed_model = OpenAIEmbedding()
Step 2: Load data
Next, you will create a local directory named data
in your root directory and download some example data from the LlamaIndex GitHub repository (MIT license).
!mkdir -p 'data'
!wget '<https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt>' -O 'data/paul_graham_essay.txt'
Afterward, you can load the data for further processing:
from llama_index.core import SimpleDirectoryReader# Load data
documents = SimpleDirectoryReader(
input_files=["./data/paul_graham_essay.txt"]
).load_data()
Step 3: Chunk documents into nodes
As the entire document is too large to fit into the context window of the LLM, you will need to partition it into smaller text chunks, which are called Nodes
in LlamaIndex. You can parse the loaded documents into nodes using the SimpleNodeParser
with a defined chunk size of 1024.
from llama_index.core.node_parser import SimpleNodeParsernode_parser = SimpleNodeParser.from_defaults(chunk_size=1024)
# Extract nodes from documents
nodes = node_parser.get_nodes_from_documents(documents)
Step 4: Build index
Next, you will build the index that stores all the external knowledge in Weaviate, an open source vector database.
First, you will need to connect to a Weaviate instance. In this case, we’re using Weaviate Embedded, which allows you to experiment in Notebooks for free without an API key. For a production-ready solution, deploying Weaviate yourself, e.g., via Docker or utilizing a managed service, is recommended.
import weaviate# Connect to your Weaviate instance
client = weaviate.Client(
embedded_options=weaviate.embedded.EmbeddedOptions(),
)
Next, you will build a VectorStoreIndex
from the Weaviate client to store your data in and interact with.
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.weaviate import WeaviateVectorStoreindex_name = "MyExternalContext"
# Construct vector store
vector_store = WeaviateVectorStore(
weaviate_client = client,
index_name = index_name
)
# Set up the storage for the embeddings
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Setup the index
# build VectorStoreIndex that takes care of chunking documents
# and encoding chunks to embeddings for future retrieval
index = VectorStoreIndex(
nodes,
storage_context = storage_context,
)
Step 5: Setup query engine
Lastly, you will set up the index as the query engine.
# The QueryEngine class is equipped with the generator
# and facilitates the retrieval and generation steps
query_engine = index.as_query_engine()
Step 6: Run a naive RAG query on your data
Now, you can run a naive RAG query on your data, as shown below:
# Run your naive RAG query
response = query_engine.query(
"What happened at Interleaf?"
)
In this section, we will cover some simple adjustments you can make to turn the above naive RAG pipeline into an advanced one. This walkthrough will cover the following selection of advanced RAG techniques:
As we will only cover the modifications here, you can find the full end-to-end advanced RAG pipeline in this Jupyter Notebook.
For the sentence window retrieval technique, you need to make two adjustments: First, you must adjust how you store and post-process your data. Instead of the SimpleNodeParser
, we will use the SentenceWindowNodeParser
.
from llama_index.core.node_parser import SentenceWindowNodeParser# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3,
window_metadata_key="window",
original_text_metadata_key="original_text",
)
The SentenceWindowNodeParser
does two things:
- It separates the document into single sentences, which will be embedded.
- For each sentence, it creates a context window. If you specify a
window_size = 3
, the resulting window will be three sentences long, starting at the previous sentence of the embedded sentence and spanning the sentence after. The window will be stored as metadata.
During retrieval, the sentence that most closely matches the query is returned. After retrieval, you need to replace the sentence with the entire window from the metadata by defining a MetadataReplacementPostProcessor
and using it in the list of node_postprocessors
.
from llama_index.core.postprocessor import MetadataReplacementPostProcessor# The target key defaults to `window` to match the node_parser's default
postproc = MetadataReplacementPostProcessor(
target_metadata_key="window"
)
...
query_engine = index.as_query_engine(
node_postprocessors = [postproc],
)
Implementing a hybrid search in LlamaIndex is as easy as two parameter changes to the query_engine
if the underlying vector database supports hybrid search queries. The alpha
parameter specifies the weighting between vector search and keyword-based search, where alpha=0
means keyword-based search and alpha=1
means pure vector search.
query_engine = index.as_query_engine(
...,
vector_store_query_mode="hybrid",
alpha=0.5,
...
)
Adding a reranker to your advanced RAG pipeline only takes three simple steps:
- First, define a reranker model. Here, we are using the
BAAI/bge-reranker-base
from Hugging Face. - In the query engine, add the reranker model to the list of
node_postprocessors
. - Increase the
similarity_top_k
in the query engine to retrieve more context passages, which can be reduced totop_n
after reranking.
# !pip install torch sentence-transformers
from llama_index.core.postprocessor import SentenceTransformerRerank# Define reranker model
rerank = SentenceTransformerRerank(
top_n = 2,
model = "BAAI/bge-reranker-base"
)
...
# Add reranker to query engine
query_engine = index.as_query_engine(
similarity_top_k = 6,
...,
node_postprocessors = [rerank],
...,
)