How you can Construct Agentic QA RAG System Utilizing Haystack Framework

March 2, 2025

6

Think about you’re constructing a buyer assist AI that should reply questions on your product. Typically it wants to tug data out of your documentation, whereas different instances it wants to look the online for the newest updates. Agentic RAG programs turn out to be useful in such varieties of advanced AI purposes. Consider them as sensible analysis assistants who not solely know your inner documentation but in addition resolve when to go to look the online. On this information, we are going to stroll by the method of constructing an agentic QA RAG system utilizing the Haystack framework.

Studying Aims

Know what an agentic LLM is and perceive how it’s totally different from a RAG system.
Familiarize the Haystack framework for agentic LLM purposes.
Perceive the method of immediate constructing from a template and discover ways to be a part of totally different prompts collectively.
Discover ways to create embedding utilizing ChromaDB in Haystack.
Discover ways to arrange an entire native growth system from embedding to era.

This text was revealed as part of the Knowledge Science Blogathon.

What’s an Agentic LLM?

An agentic LLM is an AI system that may autonomously make choices and take actions primarily based on its understanding of the duty. In contrast to conventional LLMs that primarily generate textual content responses, an agentic LLM can do much more. It might assume, plan, and act with minimal human enter. It assesses its data, recognizing when it wants extra data or exterior instruments. Agentic LLMs don’t depend on static information or listed data, as an alternative, they resolve which sources to belief and how one can collect the perfect insights.

The sort of system can even choose the appropriate instruments for the job. It might resolve when it must retrieve paperwork, run calculations, or automate duties. What units them aside is its means to interrupt down advanced issues into steps and execute them independently which makes it worthwhile for analysis, evaluation, and workflow automation.

RAG vs Agentic RAG

Conventional RAG programs observe a linear course of. When a question is obtained, the system first identifies the important thing components inside the request. It then searches the data base, scanning for related data that may assist design an correct response. As soon as the related data or information is retrieved, the system processes it to generate a significant and contextually related response.

You may perceive the processes simply by the beneath diagram.

Now, an agentic RAG system enhances this course of by:

Evaluating question necessities
Deciding between a number of data sources
Probably combining data from totally different sources
Making autonomous choices about response technique
Offering source-attributed responses

The key distinction lies within the system’s means to make clever choices about how one can deal with queries, slightly than following a hard and fast retrieval-generation sample.

Understanding Haystack Framework Parts

Haystack is an open-source framework for constructing production-ready AI, LLM purposes, RAG pipelines, and search programs. It provides a strong and versatile framework for constructing LLM purposes. It permits you to combine fashions from numerous platforms akin to Huggingface, OpenAI, CoHere, Mistral, and Native Ollama. You can too deploy fashions on cloud providers like AWS SageMaker, BedRock, Azure, and GCP.

Haystack supplies strong doc shops for environment friendly information administration. It additionally comes with a complete set of instruments for analysis, monitoring, and information integration which guarantee clean efficiency throughout all layers of your utility. It additionally has robust group collaboration which makes new service integration from numerous service suppliers periodically.

What Can You Construct Utilizing Haystack?

Easy to advance RAG in your information, utilizing strong retrieval and era strategies.
Chatbot and brokers utilizing up-to-date GenAI fashions like GPT-4, Llama3.2, Deepseek-R1.
Generative multimodal question-answering system on combined sorts (photos, textual content, audio, and desk) data base.
Info extraction from paperwork or constructing data graphs.

Haystack Constructing Blocks

Haystack has two major ideas for constructing absolutely practical GenAI LLM programs – elements and pipelines. Let’s perceive them with a easy instance of RAG on Japanese Anime Characters

Parts

Parts are the core constructing blocks of Haystack. They will carry out duties akin to doc storing, doc retrieval, textual content era, and embedding. Haystack has many elements you should utilize straight after set up, it additionally supplies APIs for making your individual elements by writing a Python class.

There’s a assortment of integration from accomplice firms and the group.

Set up Libraries and set Ollama

$ pip set up haystack-ai ollama-haystack

# On you system obtain Ollama and set up LLM

ollama pull llama3.2:3b

ollama pull nomic-embed-text


# After which begin ollama server
ollama serve

Import some elements

from haystack import Doc, Pipeline
from haystack.elements.builders.prompt_builder import PromptBuilder
from haystack.elements.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.elements.turbines.ollama import OllamaGenerator

Create a doc and doc retailer

document_store = InMemoryDocumentStore()
paperwork = [
    Document(
        content="Naruto Uzumaki is a ninja from the Hidden Leaf Village and aspires to become Hokage."
    ),
    Document(
        content="Luffy is the captain of the Straw Hat Pirates and dreams of finding the One Piece."
    ),
    Document(
        content="Goku, a Saiyan warrior, has defended Earth from numerous powerful enemies like Frieza and Cell."
    ),
    Document(
        content="Light Yagami finds a mysterious Death Note, which allows him to eliminate people by writing their names."
    ),
    Document(
        content="Levi Ackerman is humanity’s strongest soldier, fighting against the Titans to protect mankind."
    ),
]

Pipeline

Pipelines are the spine of Haystack’s framework. They outline the circulation of knowledge between totally different elements. Pipelines are basically a Directed Acyclic Graph (DAG). A single element with a number of outputs can join to a different single element with a number of inputs.

You may outline pipeline by

pipe = Pipeline()

pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
pipe.add_component("prompt_builder", PromptBuilder(template=template))
pipe.add_component(
    "llm", OllamaGenerator(mannequin="llama3.2:1b", url="http://localhost:11434")
)
pipe.join("retriever", "prompt_builder.paperwork")
pipe.join("prompt_builder", "llm")

You may visualize the pipeline

image_param = {
    "format": "img",
    "kind": "png",
    "theme": "forest",
    "bgColor": "f2f3f4",
}
pipe.present(params=image_param)

The pipeline supplies:

Modular workflow administration
Versatile elements association
Simple debugging and monitoring
Scalable processing structure

Nodes

Nodes are the fundamental processing items that may be related in a pipeline these nodes are the elements that carry out particular duties.

Examples of nodes from the above pipeline

pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
pipe.add_component("prompt_builder", PromptBuilder(template=template))
pipe.add_component(
    "llm", OllamaGenerator(mannequin="llama3.2:1b", url="http://localhost:11434")
)

Connection Graph

The connection graph defines how elements work together.

From the above pipeline, you possibly can visualize the connection graph.

image_param = {
    "format": "img",
    "kind": "png",
    "theme": "forest",
    "bgColor": "f2f3f4",
}
pipe.present(params=image_param)

The connection graph of the anime pipeline

Building Agentic QA-RAG Using Haystack Framework

This graph construction:

Defines information circulation between elements
Manages enter/output relationships
Allows parallel processing the place potential
Creates versatile processing pathways.

Now we will question our anime data base utilizing the immediate.

Create a immediate template

template = """
Given solely the next data, reply the query.
Ignore your individual data.

Context:
{% for doc in paperwork %}
    {{ doc.content material }}
{% endfor %}

Query: {{ question }}?
"""

This immediate will present a solution taking data from the doc base.

Question utilizing immediate and retriever

question = "How Goku get rid of individuals?"
response = pipe.run({"prompt_builder": {"question": question}, "retriever": {"question": question}})
print(response["llm"]["replies"])

Response:

This RAG is straightforward but conceptually worthwhile to the newcomer. Now that we’ve understood a lot of the ideas of Haystack frameworks, we will deep dive into our predominant venture. If any new factor comes up I’ll clarify alongside the best way.

Query-Reply RAG Mission for Larger Secondary Physics

We’ll construct an NCERT Physics books-based Query Reply RAG for larger secondary college students. It’s going to present solutions to the question by taking data from the NCERT books, and If the knowledge is just not there it would search the online to get that data.
For this, I’ll use:

Native Llama3.2:3b or Llama3.2:1b
ChromaDB for embedding storage
Nomic Embed Textual content mannequin for native embedding
DuckDuckGo seek for internet search or Tavily Search (non-compulsory)

I take advantage of a free, completely localized system.

Setting Up the Developer Atmosphere

We’ll setup a conda env Python 3.12

$conda create --name agenticlm python=3.12

$conda activate agenticlm

Set up Essential Bundle

$pip set up haystack-ai ollama-haystack pypdf

$pip set up chroma-haystack duckduckgo-api-haystack

Now create a venture listing named qagent.

$md qagent # create dir

$cd qagent # change to dir

$ code .   # open folder in vscode

You should utilize plain Python recordsdata for the venture or Jupyter Pocket book for the venture it doesn’t matter. I’ll use a plain Python file.

Create a predominant.py file on the venture root.

Importing Essential Libraries

System packages
Core haystack elements
ChromaDB for embedding elements
Ollama Parts for Native Inferences
And Duckduckgo for internet search

# System packages
import os
from pathlib import Path

# Core haystack elements
from haystack import Pipeline
from haystack.elements.writers import DocumentWriter
from haystack.elements.joiners import BranchJoiner
from haystack.document_stores.sorts import DuplicatePolicy
from haystack.elements.converters import PyPDFToDocument
from haystack.elements.routers import ConditionalRouter
from haystack.elements.builders.prompt_builder import PromptBuilder
from haystack.elements.preprocessors import DocumentCleaner, DocumentSplitter

# ChromaDB integration
from haystack_integrations.document_stores.chroma import ChromaDocumentStore
from haystack_integrations.elements.retrievers.chroma import (
    ChromaEmbeddingRetriever,
)

# Ollama integration
from haystack_integrations.elements.embedders.ollama.document_embedder import (
    OllamaDocumentEmbedder,
)
from haystack_integrations.elements.embedders.ollama.text_embedder import (
    OllamaTextEmbedder,
)
from haystack_integrations.elements.turbines.ollama import OllamaGenerator

# Duckduckgo search integration
from duckduckgo_api_haystack import DuckduckgoApiWebSearch

Making a Doc Retailer

Doc retailer is an important right here we are going to retailer our embedding for retrieval, we use ChromaDB for the embedding retailer, and as you may even see within the earlier instance, we use InMemoryDocumentStore for quick retrieval as a result of then our information was tiny however for a strong system of retrieval we don’t depend on the InMemoryStore, it would hog the reminiscence and we may have creat embeddings each time we begin the system.

The answer is a Vector database akin to Pinecode, Weaviate, Postgres Vector DB, or ChromaDB. I take advantage of ChromaDB as a result of free, open-source, simple to make use of, and strong.

# Chroma DB integration element for doc(embedding) retailer

document_store = ChromaDocumentStore(persist_path="qagent/embeddings")

persist_path is the place you need to retailer your embedding.

PDF recordsdata path

HERE = Path(__file__).resolve().father or mother
file_path = [HERE / "data" / Path(name) for name in os.listdir("QApipeline/data")]

It’s going to create an inventory of recordsdata from the information folder which consists of our PDF recordsdata.

Doc Preprocessing Parts

We’ll use Haystack’s built-in doc preprocessor akin to cleaner, splitter, and file converter, after which use a author to put in writing the information into the shop.

Cleaner: It’s going to clear the additional area, repeated traces, empty traces, and so on from the paperwork.

cleaner = DocumentCleaner()

Splitter: It’s going to cut up the doc in numerous methods akin to phrases, sentences, para, pages.

splitter = DocumentSplitter()

File Converter: It’s going to use the pypdf to transform the pdf to paperwork.

file_converter = PyPDFToDocument()

Author: It’s going to retailer the doc the place you need to retailer the paperwork and for duplicate paperwork, it would overwrite with earlier one.

author = DocumentWriter(document_store=document_store, coverage=DuplicatePolicy.OVERWRITE)

Now set the embedder for doc indexing.

Embedder: Nomic Embed Textual content

We’ll use nomic-embed-text embedder which may be very efficient and free inHuggingface and Ollama.

Earlier than you run your indexing pipeline open your terminal and sort beneath to Pull the nomic-embed-text and llama3.2:3b mannequin from the Ollama mannequin retailer

$ ollama pull nomic-embed-text

$ ollama pull llama3.2:3b

and begin Ollama by typing the command ollama serve in your terminal

now embedder element

embedder = OllamaDocumentEmbedder(
    mannequin="nomic-embed-text", url="http://localhost:11434"
)

We use OllamaDocumentEmbedder element for embedding paperwork, however if you wish to embed the textual content string then it’s a must to use OllamaTextEmbedder.

Creating Indexing Pipeline

Like our earlier toy RAG instance, we are going to begin by initiating the Pipeline class.

indexing_pipeline = Pipeline()

Now we are going to add the elements to our pipeline one after the other

indexing_pipeline.add_component("embedder", embedder)
indexing_pipeline.add_component("converter", file_converter)
indexing_pipeline.add_component("cleaner", cleaner)
indexing_pipeline.add_component("splitter", splitter)
indexing_pipeline.add_component("author", author)

Including elements to the pipeline doesn’t care about order so, you possibly can add elements in any order. however connecting is what issues.

Connecting Parts to the Pipeline Graph

indexing_pipeline.join("converter", "cleaner")
indexing_pipeline.join("cleaner", "splitter")
indexing_pipeline.join("splitter", "embedder")
indexing_pipeline.join("embedder", "author")

Right here, order issues, as a result of the way you join the element tells the pipeline how the information will circulation by the pipeline. It’s like, It doesn’t matter by which order or from the place you purchase your plumbing objects however how one can put them collectively will resolve whether or not you get your water or not.

The converter converts the PDFs and sends them to wash for cleansing. Then the cleaner sends the cleaned paperwork to the splitter for chunking. These chunks will then move to the embedded for vectorization, and the final embedded will hand over these embeddings to the author for storage.

Perceive! Okay, let me offer you a visible graph of the indexing so you possibly can examine the information circulation.

Draw Indexing Pipeline

image_param = {
    "format": "img",
    "kind": "png",
    "theme": "forest",
    "bgColor": "f2f3f4",
}

indexing_pipeline.draw("indexing_pipeline.png", params=image_param)  # kind: ignore

Yeah, you possibly can create a pleasant mermaid graph from the haystack pipeline simply.

Graph of Indexing Pipeline

I assume now you could have absolutely grasped the concept behind the Haystack Pipeline. Give a thank to you Plumber.

Implement a Router

Now, we have to create a router to route the information by a special path. On this case, we’ll use a conditional router which is able to do our routing job on sure circumstances. The conditional router will consider circumstances primarily based on element output. It’s going to direct information circulation by totally different pipeline branches which permits dynamic decision-making. It’s going to even have strong fallback methods.

# Situations for routing
routes = [
    {
        "condition": "{{'no_answer' in replies[0]}}",
        "output": "{{question}}",
        "output_name": "go_to_websearch",
        "output_type": str,
    },
    {
        "situation": "{{'no_answer' not in replies[0]}}",
        "output": "{{replies[0]}}",
        "output_name": "reply",
        "output_type": str,
    },
]


# router element

router = ConditionalRouter(routes=routes)

When the system will get no_answer replies from the embedding retailer context, then it would go to the online search instruments for amassing related information from the web.

For internet search, we are going to use Duckduckgo API or Tavily, right here I’ve used Duckduckgo.

websearch = DuckduckgoApiWebSearch(top_k=5)

Okay, a lot of the heavy lifting has been executed. Now, time for immediate engineering

Create Immediate Templates

We’ll use the Haystack PromptBuilder element for constructing prompts from the template

First, we are going to create a immediate for qa

template_qa = """
Given ONLY the next data, reply the query.
If the reply is just not contained inside the paperwork reply with "no_answer.
If the reply is contained inside the paperwork, begin the reply with "FROM THE KNOWLEDGE BASE: ".

Context:
{% for doc in paperwork %}
    {{ doc.content material }}
{% endfor %}

Query: {{ question }}?

"""

It’s going to take the context from the doc and attempt to reply the query. But when it doesn’t discover related context within the paperwork it would reply no_answer.

Now, within the second immediate after getting no_answer from the LLM, the system will use the online search instruments for gathering context from the web.

Duckduckgo immediate template

template_websearch = """
Reply the next question given the paperwork retrieved from the online.
Begin the reply with "FROM THE WEB: ".

Paperwork:
{% for doc in paperwork %}
    {{ doc.content material }}
{% endfor %}

Question: {{question}}

"""

It’s going to facilitate the system to go to the online search and attempt to reply the question.

Creating immediate utilizing PromptBuilder from Haystack

prompt_qa = PromptBuilder(template=template_qa)

prompt_builder_websearch = PromptBuilder(template=template_websearch)

We’ll use Haystack immediate joiner to affix to branches of the immediate collectively.

prompt_joiner = BranchJoiner(str)

Implement Question Pipeline

The question pipeline might be embedding the question gathering contextual sources from the embeddings and answering our question utilizing LLM or Net Search software.

It’s much like the indexing pipeline.

Initiating Pipeline

query_pipeline = Pipeline()

Including elements to the question pipeline

query_pipeline.add_component("text_embedder", OllamaTextEmbedder())
query_pipeline.add_component(
    "retriever", ChromaEmbeddingRetriever(document_store=document_store)
)
query_pipeline.add_component("prompt_builder", prompt_qa)
query_pipeline.add_component("prompt_joiner", prompt_joiner)
query_pipeline.add_component(
    "llm",
    OllamaGenerator(mannequin="llama3.2:3b", timeout=500, url="http://localhost:11434"),
)
query_pipeline.add_component("router", router)
query_pipeline.add_component("websearch", websearch)
query_pipeline.add_component("prompt_builder_websearch", prompt_builder_websearch)

Right here, for LLM era we use the OllamaGenerator element for producing solutions utilizing Llama3.2:3b or 1b or no matter LLM you want with instruments calling.

Connecting all of the elements collectively for question circulation and reply era

query_pipeline.join("text_embedder.embedding", "retriever.query_embedding")
query_pipeline.join("retriever", "prompt_builder.paperwork")
query_pipeline.join("prompt_builder", "prompt_joiner")
query_pipeline.join("prompt_joiner", "llm")
query_pipeline.join("llm.replies", "router.replies")
query_pipeline.join("router.go_to_websearch", "websearch.question")
query_pipeline.join("router.go_to_websearch", "prompt_builder_websearch.question")
query_pipeline.join("websearch.paperwork", "prompt_builder_websearch.paperwork")
query_pipeline.join("prompt_builder_websearch", "prompt_joiner")

In abstract of the above connection:

The embedding from the text_embedder despatched to the retriever’s question embedding.
The retriever sends information to the prompt_builder’s doc.
Immediate builder go to the immediate joiner to affix with different prompts.
Immediate joiner passes information to the llm for era.
LLM’s replies go to the routers to examine if the reply has no_answer or not. If no_answer then it would go to the online search module.
Net search sends the information to an internet search immediate as a question.
Net search paperwork ship information to the online search paperwork.
The online search immediate sends the information to the immediate joiner.
And the immediate joiner will ship the information to the LLM for reply era.

Why not see for your self?

Draw Question Pipeline Graph

query_pipeline.draw("agentic_qa_pipeline.png", params=image_param)  # kind: ignore

Question Graph

I do know it’s a enormous graph however it would present you precisely what’s going on beneath the stomach of the beast.

Now it’s time to benefit from the fruit of our laborious work.

Create a operate for straightforward querying.

def get_answer(question: str):
    response = query_pipeline.run(
        {
            "text_embedder": {"textual content": question},
            "prompt_builder": {"question": question},
            "router": {"question": question},
        }
    )
    return response["router"]["answer"]

It’s a straightforward easy operate for reply era.

Now run your predominant script for indexing the NCERT physics guide

indexing_pipeline.run({"converter": {"sources": file_path}})

It’s a one-time job, after indexing you will need to touch upon this line in any other case it would begin re-indexing the books.

and the underside of the file we write our driver code for the question

if __name__ == "__main__":
    question = "Give me 5 MCQ on resistivity?"
    print(get_answer(question))

MCQ on resistivity from the guide’s data

One other query that isn’t within the guide

if __name__ == "__main__":
    question = "What's Photosynthesis?"
    print(get_answer(question))

Output

Let’s attempt one other query.

if __name__ == "__main__":
    question = (
        "Inform me what's DRIFT OF ELECTRONS AND THE ORIGIN OF RESISTIVITY from the guide"
    )
    print(get_answer(question))

So, it’s working! We are able to use extra information, books, or PDFs for embedding which is able to generate extra contextual-aware solutions. Additionally, LLMs akin to GPT-4o, Anthropic’s Claude, or different cloud LLMs will do the job even higher.

Conclusion

Our agentic RAG system demonstrates the flexibleness and robustness of the Haystack framework with its energy of mixing elements and pipelines. This RAG could be made production-ready by deploying to the online service platform and likewise utilizing higher paid LLM akin to OpenAI, and nthropic. You may construct a UI utilizing Streamlit or React-based internet SPA for a greater person expertise.

You’ll find all of the code used within the article, right here.

Key Takeaways

Agentic RAG programs present extra clever and versatile responses than conventional RAG.
Haystack’s pipeline structure permits advanced, modular workflows.
Routers allow dynamic decision-making in response era.
Connection graphs present versatile and maintainable element interactions.
Integration of a number of data sources enhances response high quality.

The media proven on this article is just not owned by Analytics Vidhya and is used on the Creator’s discretion.

Incessantly Requested Query

Q1. How does the system deal with unknown queries?

A. The system makes use of its router element to robotically fall again to internet search when native data is inadequate, making certain complete protection.

Q2. What benefits does the pipeline structure provide?

A. The pipeline structure permits modular growth, simple testing, and versatile element association, making the system maintainable and extensible.

Q3. How does the connection graph improve system performance?

A. The connection graph permits advanced information flows and parallel processing, bettering system effectivity and adaptability in dealing with various kinds of queries.

This autumn. Can I take advantage of different LLM APIs?

A. Sure, it is extremely simple simply set up the mandatory integration bundle for the respective LLM API akin to Gemini, Anthropic, and Groq, and use it along with your API keys.

A self-taught, project-driven learner, like to work on advanced initiatives on deep studying, Laptop imaginative and prescient, and NLP. I all the time attempt to get a deep understanding of the subject which can be in any area akin to Deep studying, Machine studying, or Physics. Like to create content material on my studying. Attempt to share my understanding with the worlds.