-4.1 C
United States of America
Wednesday, January 22, 2025

AI Agent Utilizing Multimodal Method


With this weblog, I wish to present one small agent built-in with `LangGraph` and Google Gemini for analysis functions. The target is to reveal one analysis agent (Paper-to-Voice Assistant) who plans to summarize the analysis paper. This device will use a imaginative and prescient mannequin to deduce the data. This technique solely identifies the step and its sub-steps and tries to get the reply for these motion gadgets. Lastly, all of the solutions are transformed into conversations through which two individuals will talk about the paper. You’ll be able to think about this a mini NotebookLM of Google.

To elaborate additional, I’m utilizing a single uni-directed graph the place communication between steps occurs from prime to backside. I’ve additionally used conditional node connections to course of repeated jobs.

  • A course of to construct easy brokers with the assistance of Langgraph
  • MultiModal Dialog with Google Gemini llm
AI Agent Utilizing Multimodal Method

Paper-to-Voice Assistant: Map-reduce in Agentic AI

Think about a large downside so giant {that a} single individual would take a number of days/months to resolve. Now, image a workforce of expert individuals, every given a particular part of the duty to work on. They could begin by sorting the plans by goal or complexity, then regularly piecing collectively smaller sections. As soon as every solver has accomplished their part, they mix their options into the ultimate one.

That is primarily how map-reduce works in Agentic AI. The principle “downside” will get divided into sub-problem statements. The “solvers” are particular person LLMs, which map every sub-plan with completely different solvers. Every solver works on its assigned sub-problem, performing calculations or inferring info as wanted. Lastly, the outcomes from all “solvers” are mixed (lowered) to supply the ultimate output.

From Automation to Help: The Evolving Position of AI Brokers

Map-Reduce in Agentic conversation
Map-reduce in Agentic Dialog

After developments in generative AI, LLM brokers are fairly in style, and individuals are benefiting from their capabilities. Some argue that brokers can automate the method finish to finish. Nonetheless, I view them as productiveness enablers. They will help in problem-solving, designing workflow, and enabling people to give attention to crucial elements. For example, brokers can function automated provers exploring the house of mathematical proofs. They will supply new views and methods of pondering past “human-generated” proofs.

One other latest instance is the AI-enabled Cursor Studio. The cursor is a managed surroundings much like VS code that helps programming help.

Brokers are additionally changing into extra able to planning and taking motion, resembling the way in which people cause, and most significantly, they’ll regulate their methods. They’re rapidly bettering of their capability to research duties, develop plans to finish them and refine their strategy by way of repeated self-critique. Some methods contain conserving people within the loop, the place brokers search steering from people at intervals after which proceed primarily based on these directions.

Agents in the field
Brokers within the Discipline

What is just not included?

  • I’ve not included any instruments like search or any customized perform to make it extra superior.
  • No routing strategy or reverse connection is developed.
  • No branching methods are used for parallel processing or conditional job
  • Related issues might be carried out by loading pdf and parsing photographs and graphs.
  • Taking part in with solely 3 photographs in a single immediate.

Python Libraries Used

  • langchain-google-genai : To attach langchain with Google generative AI fashions
  • python-dotenv: To load secrete keys or any surroundings variables
  • langgraph: To assemble the brokers
  • pypdfium2 & pillow: To transform PDF into photographs
  • pydub : to section the audio
  • gradio_client : to name HF 🤗 mannequin

Paper-to-Voice Assistant: Sensible Implementation

Right here’s the implementation:

Load the Supporting Libraries

from dotenv import dotenv_values
from langchain_core.messages import HumanMessage
import os
from langchain_google_genai import ChatGoogleGenerativeAI

from langgraph.graph import StateGraph, START, END,MessagesState
from langgraph.graph.message import add_messages
from langgraph.constants import Ship
import pypdfium2 as pdfium
import json
from PIL import Picture

import operator
from typing import Annotated, TypedDict # ,Optionally available, Listing
from langchain_core.pydantic_v1 import BaseModel # ,Discipline

Load Atmosphere Variables

config = dotenv_values("../.env")
os.environ["GOOGLE_API_KEY"] = config['GEMINI_API']

Presently, I’m utilizing a Multimodal strategy on this mission. To attain this, I load a PDF file and convert every web page into a picture. These photographs are then fed into the Gemini imaginative and prescient mannequin for conversational functions.

The next code demonstrates how you can load a PDF file, convert every web page into a picture, and save these photographs to a listing.

pdf = pdfium.PdfDocument("./pdf_image/imaginative and prescient.pdf")

for i in vary(len(pdf)):
    web page = pdf[i]
    picture = web page.render(scale=4).to_pil()
    picture.save(f"./pdf_image/vision_P{i:03d}.jpg")

Let’s show the one web page for our reference.

image_path = "./pdf_image/vision_P000.jpg"
img = Picture.open(image_path)
img

Output:

research paper

Google Vission Mannequin

Connecting Gemini mannequin by way of API key. The next are the completely different variants obtainable. One necessary factor to note is that we should always choose the mannequin that helps the information sort. Since I’m working with photographs within the dialog, I wanted to go for both the Gemini 1.5 Professional or Flash fashions, as these variants assist picture knowledge.

Google LLM stack
Google LLM stack
llm = ChatGoogleGenerativeAI(
    mannequin="gemini-1.5-flash-001", # "gemini-1.5-pro",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

Let’s transfer ahead by constructing the schema to handle the output and switch info between nodes. Make the most of the add operator to merge all of the steps and sub-steps that the agent will create.

class State(TypedDict):
    image_path: str # To retailer reference of all of the pages
    steps:  Annotated[list, operator.add] # Retailer all of the steps generated by the agent

    substeps: Annotated[list, operator.add] # Retailer all of the sub steps generated by the agent for each step
    options:  Annotated[list, operator.add] # Retailer all of the options generated by the agent for every step
    content material:str # retailer the content material of the paper
    plan:str # retailer the processed plan
    Dialog:  Annotated[list, operator.add]

The Schema to regulate the output of step one.

class Process(BaseModel):
    process: str

Schema will retailer the output of every sub-step primarily based on the duties recognized within the earlier step.

class SubStep(BaseModel):
    substep: str

Schema to regulate the output throughout conditional looping.

class StepState(TypedDict):
    step: str
    image_path: str
    options: str
    Dialog:  str

Step 1: Generate Duties

In our first step, we’ll move the pictures to the LLM and instruct it to establish all of the plans it intends to execute to know the analysis paper absolutely. I’ll present a number of pages directly, asking the mannequin to generate a mixed plan primarily based on all the pictures.

def generate_steps(state: State):
    immediate="""
    Think about you're a analysis scientist in synthetic intelligence who's skilled in understanding analysis papers.
    You may be given a analysis paper and it's worthwhile to establish all of the steps a researcher must carry out.
    Establish every steps and their substeps.
    """
    message = HumanMessage(content material=[{'type':'text','text':prompt},
                                    *[{"type":'image_url','image_url':img} for img in state['image_path']]
                                    ]

                                    )
    response = llm.invoke([message])
    return {"content material": [response.content],"image_path":state['image_path']}

Step 2: Plan Parsing

On this step, we are going to take the plan recognized in step one and instruct the mannequin to transform it right into a structured format. I’ve outlined the schema and included that info with the immediate. It’s necessary to notice that exterior parsing instruments can be utilized to remodel the plan into a correct knowledge construction. To reinforce robustness, you could possibly use “Instruments” to parse the information or create a extra rigorous schema as effectively.

def markdown_to_json(state: State):
    immediate ="""
    You might be given a markdown content material and it's worthwhile to parse this knowledge into json format. Observe accurately key and worth
    pairs for every bullet level.
    Observe following schema strictly.

    schema:
    [
    {
      "step": "description of step 1 ",
      "substeps": [
        {
          "key": "title of sub step 1 of step 1",
          "value": "description of sub step 1 of step 1"
        },
        {
          "key": "title of sub step 2 of step 1",
          "value": "description of sub step 2 of step 1"
        }]},
        {
      "step": "description of step 2",
      "substeps": [
        {
          "key": "title of sub step 1 of step 2",
          "value": "description of sub step 1 of step 2"
        },
        {
          "key": "title of sub step 2 of step 2",
          "value": "description of sub step 2 of step 2"
        }]}]'

    Content material:
    %s
    """% state['content']
    str_response = llm.invoke([prompt])
    return({'content material':str_response.content material,"image_path":state['image_path']})

Step 3 Textual content To Json

Within the third step, we are going to take the plan recognized in 2nd step and convert it into JSON format. Please be aware that this technique could not at all times work if LLM violates the schema construction.

def parse_json(state: State):
    str_response_json = json.masses(state['content'][7:-3].strip())
    output = []
    for step in str_response_json:
        substeps = []
        for merchandise in step['substeps']:
            for ok,v in merchandise.gadgets():
                if ok=='worth':
                    substeps.append(v)
        output.append({'step':step['step'],'substeps':substeps})
    return({"plan":output})

Step 4: Resolution For Every Step

The answer findings of each step shall be taken care of by this step. It can take one plan and mix all of the sub-plans in a pair of Questions and Solutions in a single immediate. Moreover, LLM will establish the answer of each sub-step.

This step can even mix the a number of outputs into one with the assistance of the Annotator and “add” operator. For now, this step is working with honest high quality. Nonetheless, it may be improved through the use of branching and translating each sub-step into a correct reasoning immediate.

Primarily, each substep ought to get translated into a series of thought in order that LLM can put together an answer. One can use React as effectively.

def sovle_substeps(state: StepState):
    print(state)
    inp = state['step']
    print('fixing sub steps')
    qanda=" ".be a part of([f'n Question: {substep} n Answer:' for substep  in inp['substeps']])
    immediate=f""" You may be given instruction to research analysis papers. You might want to perceive the
    instruction and resolve all of the questions talked about within the record.
    Maintain the pair of Query and its reply in your response. Your response must be subsequent to the key phrase "Reply"

    Instruction:
    {inp['step']}
    Questions:
    {qanda}
    """
    message = HumanMessage(content material=[{'type':'text','text':prompt},
                                    *[{"type":'image_url','image_url':img} for img in state['image_path']]
                                    ]
                                    )
    response = llm.invoke([message])
    return {"steps":[inp['step']], 'options':[response.content]}

Step 5: Conditional Loop

This step is crucial for managing the stream of the dialog. It entails an iterative course of that maps out the generated plans (steps) and constantly passes info from one node to a different in a loop. The loop terminates as soon as all of the plans have been executed. Presently, this step handles one-way communication between nodes, but when there’s a necessity for bi-directional communication, we would wish to contemplate implementing completely different branching methods.

def continue_to_substeps(state: State):
    steps = state['plan']
    return [Send("sovle_substeps", {"step": s,'image_path':state['image_path']}) for s in steps]

Step 6: Voice

After all of the solutions are generated, the next code will flip them right into a dialogue after which mix every thing right into a podcast that includes two individuals discussing the paper. The next immediate is taken from right here.

SYSTEM_PROMPT = """
You're a world-class podcast producer tasked with reworking the offered enter textual content into an enticing and informative podcast script. The enter could also be unstructured or messy, sourced from PDFs or net pages. Your objective is to extract probably the most fascinating and insightful content material for a compelling podcast dialogue.

# Steps to Observe:

1. **Analyze the Enter:**
   Rigorously study the textual content, figuring out key subjects, factors, and fascinating details or anecdotes that
   might drive an enticing podcast dialog. Disregard irrelevant info or formatting points.

2. **Brainstorm Concepts:**
   Within the `<scratchpad>`, creatively brainstorm methods to current the important thing factors engagingly. Think about:
   - Analogies, storytelling methods, or hypothetical situations to make content material relatable
   - Methods to make complicated subjects accessible to a normal viewers
   - Thought-provoking inquiries to discover throughout the podcast
   - Artistic approaches to fill any gaps within the info

3. **Craft the Dialogue:**
   Develop a pure, conversational stream between the host (Jane) and the visitor speaker (the writer or an skilled on the subject). Incorporate:
   - The perfect concepts out of your brainstorming session
   - Clear explanations of complicated subjects
   - A fascinating and energetic tone to captivate listeners
   - A stability of data and leisure

   Guidelines for the dialogue:
   - The host (Jane) at all times initiates the dialog and interviews the visitor
   - Embrace considerate questions from the host to information the dialogue
   - Incorporate pure speech patterns, together with occasional verbal fillers (e.g., "um," "effectively," "you already know")
   - Enable for pure interruptions and back-and-forth between host and visitor
   - Make sure the visitor's responses are substantiated by the enter textual content, avoiding unsupported claims
   - Keep a PG-rated dialog applicable for all audiences
   - Keep away from any advertising and marketing or self-promotional content material from the visitor
   - The host concludes the dialog

4. **Summarize Key Insights:**
   Naturally weave a abstract of key factors into the closing a part of the dialogue. This could really feel like an off-the-cuff dialog moderately than a proper recap, reinforcing the primary takeaways earlier than signing off.

5. **Keep Authenticity:**
   All through the script, try for authenticity within the dialog. Embrace:
   - Moments of real curiosity or shock from the host
   - Cases the place the visitor may briefly wrestle to articulate a fancy concept
   - Mild-hearted moments or humor when applicable
   - Temporary private anecdotes or examples that relate to the subject (throughout the bounds of the enter textual content)

6. **Think about Pacing and Construction:**
   Make sure the dialogue has a pure ebb and stream:
   - Begin with a robust hook to seize the listener's consideration
   - Regularly construct complexity because the dialog progresses
   - Embrace transient "breather" moments for listeners to soak up complicated info
   - Finish on a excessive be aware, maybe with a thought-provoking query or a call-to-action for listeners
"""
def generate_dialog(state):
    textual content = state['text']
    tone = state['tone']
    size = state['length']
    language = state['language']

    modified_system_prompt = SYSTEM_PROMPT
    modified_system_prompt += f"nPLEASE paraphrase the next TEXT in dialog format."

    if tone:
        modified_system_prompt += f"nnTONE: The tone of the podcast must be {tone}."
    if size:
        length_instructions = {
            "Quick (1-2 min)": "Maintain the podcast transient, round 1-2 minutes lengthy.",
            "Medium (3-5 min)": "Goal for a average size, about 3-5 minutes.",
        }
        modified_system_prompt += f"nnLENGTH: {length_instructions[length]}"
    if language:
        modified_system_prompt += (
            f"nnOUTPUT LANGUAGE <IMPORTANT>: The the podcast must be {language}."
        )

    messages = modified_system_prompt + 'nTEXT: '+ textual content

    response = llm.invoke([messages])
    return {"Step":[state['step']],"Discovering":[state['text']], 'Dialog':[response.content]}
def continue_to_substeps_voice(state: State):
    print('voice substeps')

    options = state['solutions']
    steps = state['steps']

    tone="Formal" #  ["Fun", "Formal"]
    return [Send("generate_dialog", {"step":st,"text": s,'tone':tone,'length':"Short (1-2 min)",'language':"EN"}) for st,s in zip(steps,solutions)]

Step 7: Graph Building

Now it’s time to assemble all of the steps we’ve outlined:

  • Initialize the Graph: Begin by initializing the graph and defining all the mandatory nodes.
  • Outline and Join Nodes: Join the nodes utilizing edges, guaranteeing the stream of data from one node to a different.
  • Introduce the Loop: Implement the loop as described in step 5, permitting for iterative processing of the plans.
  • Terminate the Course of: Lastly, use the END technique of LangGraph to shut and terminate the method correctly.

It’s time to show the community for higher understanding.

graph = StateGraph(State)
graph.add_node("generate_steps", generate_steps)
graph.add_node("markdown_to_json", markdown_to_json)
graph.add_node("parse_json", parse_json)
graph.add_node("sovle_substeps", sovle_substeps)
graph.add_node("generate_dialog", generate_dialog)

graph.add_edge(START, "generate_steps")
graph.add_edge("generate_steps", "markdown_to_json")
graph.add_edge("markdown_to_json", "parse_json")
graph.add_conditional_edges("parse_json", continue_to_substeps, ["sovle_substeps"])
graph.add_conditional_edges("sovle_substeps", continue_to_substeps_voice, ["generate_dialog"])
graph.add_edge("generate_dialog", END)
app = graph.compile()

It’s time to show the community for higher understanding.

from IPython.show import Picture, show

attempt:
    show(Picture(app.get_graph().draw_mermaid_png()))
besides Exception:
    print('There appears error')
Workflow for Agent
Workflow for Agent

Let’s start the method and consider how the present framework operates. For now, I’m testing with solely three photographs, although this may be prolonged to a number of photographs. The code will take photographs as enter and move this knowledge to the graph’s entry level. Subsequently, every step will take the enter from the earlier step and move the output to the following step till the method is accomplished.

output = []
for s in app.stream({"image_path": [f"./pdf_image/vision_P{i:03d}.jpg" for i in range(6)]}):
    output.append(s)

with open('./knowledge/output.json','w') as f:
    json.dump(output,f)

Output:

Output
with open('./knowledge/output.json','r') as f:
    output1 = json.load(f)
print(output1[10]['generate_dialog']['Dialog'][0])

Output:

Output

I’m utilizing the next code to stream the outcomes sequentially, the place every plan and sub-plan shall be mapped to a corresponding response.

import sys
import time
def printify(tx,stream=False):
    if stream:
        for char in tx:
            sys.stdout.write(char)
            sys.stdout.flush()
            time.sleep(0.05)
    else:
        print(tx)
subjects =[]
substeps=[]
for idx, plan in enumerate(output1[2]['parse_json']['plan']):
    subjects.append(plan['step'])
    substeps.append(plan['substeps'])

Let’s separate the solutions of each sub-step

text_planner = {}
stream = False

for matter,substep,responses in zip(subjects,substeps,output1[3:10]):
    response = responses['sovle_substeps']
    response_topic = response['steps']
    if matter in response_topic:
        reply = response['solutions'][0].strip().break up('Reply:')
        reply =[ans.strip() for ans in answer if len(ans.strip())>0]
        for q,a in zip(substep,reply):
            printify(f'Sub-Step : {q}n',stream=stream)
            printify(f'Motion : {a}n',stream=stream)
        text_planner[topic]={'reply':record(zip(substep,reply))}

The output of Sub-steps and their Motion

Output
Step to Sub-step

The final node within the graph converts all of the solutions into dialog. Let’s retailer them in separate variables in order that we are able to convert them into voice.

stream = False

dialog_planner ={}
for matter,responses in zip(subjects,output1[10:17]):
    dialog = responses['generate_dialog']['Dialog'][0]
    dialog = dialog.strip().break up('## Podcast Script')[-1].strip()
    dialog = dialog.change('[Guest Name]','Robin').change('**Visitor:**','**Robin:**')
    printify(f'Dialog: : {dialog}n',stream=stream)
    dialog_planner[topic]=dialog

Dialog Output

Output
Substep – Dialog

Dialog to Voice

from pydantic import BaseModel, Discipline
from typing import Listing, Literal, Tuple, Optionally available
import glob
import os
import time
from pathlib import Path
from tempfile import NamedTemporaryFile

from scipy.io.wavfile import write as write_wav
import requests
from pydub import AudioSegment
from gradio_client import Shopper
import json

from tqdm.pocket book import tqdm
import sys
from time import sleep
with open('./knowledge/dialog_planner.json','r') as f:
    dialog_planner1 = json.load(f)

Now, let’s convert these textual content dialogs into voice

1. Textual content To Speech 

For now, I’m accessing the tts mannequin from the HF endpoint, and for that, I must set the URL and API key.

HF_API_URL = config['HF_API_URL']
HF_API_KEY = config['HF_API_KEY']
headers = {"Authorization": HF_API_KEY}

2. Gradio Shopper

I’m utilizing the Gradio consumer to name the mannequin and set the next path so the Gradio consumer can save the audio knowledge within the specified listing. If no path is outlined, the consumer will retailer the audio in a short lived listing.

os.environ['GRADIO_TEMP_DIR'] = "path_to_data_dir"
hf_client = Shopper("mrfakename/MeloTTS")
hf_client.output_dir= "path_to_data_dir"

def get_text_to_voice(textual content,velocity,accent,language):
   
    file_path = hf_client.predict(
                    textual content=textual content,
                    language=language,
                    speaker=accent,
                    velocity=velocity,
                    api_name="/synthesize",
                )
    return(file_path)

3. Voice Accent

To generate the dialog for the podcast, I’m assigning two completely different accents: one for the host and one other for the visitor.

def generate_podcast_audio(textual content: str, language: str) -> str:

    if "**Jane:**" in textual content:
        textual content = textual content.change("**Jane:**",'').strip()
        accent = "EN-US"
        velocity = 0.9
    elif "**Robin:**" in textual content:  # host
        textual content = textual content.change("**Robin:**",'').strip()
        accent = "EN_INDIA"
        velocity = 1
    else:
        return 'Empty Textual content'
    for try in vary(3):

        attempt:
            file_path = get_text_to_voice(textual content,velocity,accent,language)
            return file_path
        besides Exception as e:
            if try == 2:  # Final try
                elevate  # Re-raise the final exception if all makes an attempt fail
            time.sleep(1)  # Look ahead to 1 second earlier than retrying

4. Retailer voice in mp3 file

Every audio clip shall be beneath 2 minutes in size. This part will generate audio for every quick dialog, and the recordsdata shall be saved within the specified listing.

def store_voice(topic_dialog):
    audio_path = []
    merchandise =0
    for matter,dialog in tqdm(topic_dialog.gadgets()):
        dialog_speaker = dialog.break up("n")
        for speaker in tqdm(dialog_speaker):
            one_dialog = speaker.strip()

            language_for_tts =  "EN"

            if len(one_dialog)>0:
                audio_file_path = generate_podcast_audio(
                one_dialog, language_for_tts
                )
                audio_path.append(audio_file_path)
                # proceed
            sleep(5)
        break
    return(audio_path)
audio_paths = store_voice(topic_dialog=dialog_planner1)

5. Mixed Audio

Lastly, let’s mix all of the quick audio clips to create an extended dialog.

def consolidate_voice(audio_paths,voice_dir):
    audio_segments =[]
    voice_path = [paths for paths in audio_paths if paths!='Empty Text']

    audio_segment = AudioSegment.from_file(voice_dir+"/light-guitar.wav")
    audio_segments.append(audio_segment)

    for audio_file_path in tqdm(voice_path):
        audio_segment = AudioSegment.from_file(audio_file_path)
        audio_segments.append(audio_segment)

    audio_segment = AudioSegment.from_file(voice_dir+"/ambient-guitar.wav")
    audio_segments.append(audio_segment)

    combined_audio = sum(audio_segments)
    temporary_directory = voice_dir+"/tmp/"
    os.makedirs(temporary_directory, exist_ok=True)

    temporary_file = NamedTemporaryFile(
        dir=temporary_directory,
        delete=False,
        suffix=".mp3",
    )
    combined_audio.export(temporary_file.title, format="mp3")
consolidate_voice(audio_paths=audio_paths,voice_dir="./knowledge")

Additionally, to know the Agent AI higher, discover: The Agentic AI Pioneer Program.

Conclusion

General, this mission is meant purely for demonstration functions and would require vital effort and course of adjustments to create a production-ready agent. It might function a proof of idea (POC) with minimal effort. On this demonstration, I’ve not accounted for components like time complexity, price, and accuracy, that are crucial concerns in a manufacturing surroundings. With that stated, I’ll conclude right here. Thanks for studying. For extra technical particulars, please refer GitHub.

Regularly Requested Questions

Q1. What’s the fundamental function of the Paper-to-Voice Assistant?

Ans. The Paper-to-Voice Assistant is designed to simplify the method of summarizing analysis papers by changing the extracted info right into a conversational format, enabling simple understanding and accessibility.

Q2. How does the Paper-to-Voice Assistant work?

Ans. The assistant makes use of a map-reduce strategy, the place it breaks down the analysis paper into steps and sub-steps, processes the data utilizing LangGraph and Google Gemini LLMs, after which combines the outcomes right into a coherent dialogue.

Q3. What applied sciences are used on this mission?

Ans. The mission makes use of LangGraph, Google Gemini generative AI fashions, multimodal processing (imaginative and prescient and textual content), and text-to-speech conversion with the assistance of Python libraries like pypdfium2, pillow, pydub, and gradio_client.

This fall. Can the agent course of complicated analysis papers with photographs and diagrams?

Ans. Sure, the agent can analyze PDFs containing photographs by changing every web page into photographs and feeding them into the imaginative and prescient mannequin, permitting it to extract visible and textual info.

Q5. Is the present implementation production-ready?

Ans. No, this mission is a proof of idea (POC) that demonstrates the workflow. Additional optimization, dealing with of time complexity, price, and accuracy changes are wanted to make it production-ready.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles