Andrew Ng’s VisionAgent: Streamlining Imaginative and prescient AI Options

February 7, 2025

3

Mannequin	Recall	Precision	F1 Rating
Touchdown AI	77.0%	82.6%	79.7% (highest)
Microsoft Florence-2	43.4%	36.6%	39.7%
Google OWLv2	81.0%	29.5%	43.2%
Alibaba Qwen2.5-VL-7B-Instruct	26.0%	54.0%	35.1%

4. Key Takeaways

Touchdown AI’s Agentic Object Detection achieved the very best F1 Rating (79.7%), which means it balances precision and recall higher than the others.
Google OWLv2 had the very best recall (81.0%), which means it detects essentially the most objects however has decrease precision.
Alibaba’s Qwen2.5-VL-7B-Instruct had comparatively low recall however reasonable precision.
Microsoft Florence-2 confirmed decrease efficiency throughout all metrics in comparison with the opposite fashions.

Touchdown AI’s method appears to offer essentially the most balanced and efficient efficiency in object detection duties, as mirrored in its highest F1 rating. The opposite fashions have trade-offs, with some excelling in recall however missing in precision, or vice versa.

Sensible Working of VisionAgent

VisionAgent follows a structured course of to offer outputs successfully:

Add a picture you need to analyze
Write a immediate (e.g., “detect folks with glasses”)
Our AI agent analyzes the picture completely
Obtain detection outcomes in your picture

1. Immediate: “Detect greens in and across the basket”

Step 1: Dialog

A person initiates a request, describing the imaginative and prescient job in pure language. VisionAgent interprets this immediate and confirms the duty.

Enter Picture

Dialog

“I’ll allow you to write code to detect greens in and across the basket within the picture. I’ll use object detection to determine numerous greens within the scene.”

Step 2: Planner

I have to first perceive what’s within the picture and get strategies on tips on how to detect greens in and across the basket. I’ll begin by loading the picture and utilizing claude35_vqa and suggestion features.

Planning Steps

Perceive the picture contents utilizing VQA (Visible Query Answering).
Generate related strategies for the most effective detection method.
Choose instruments that swimsuit the duty, akin to object detection and color-based classification.

Instance Plan

Determine all objects within the picture.
Decide which objects are greens.
Differentiate greens inside and outdoors the basket.
Use a nested detection construction to investigate the spatial relationships.

Execution:

picture = load_image('vegetablebasket.jpg')
claude35_vqa('Are you able to describe this picture? What greens do you see in and across the basket?', [image])
suggestion('How can I detect greens in and across the basket on this picture?', [image])

Commentary

[claude35_vqa output]
This picture reveals an attractive array of recent greens organized round and in a wicker basket. The greens embody:

1. Recent carrots with inexperienced tops
2. Cauliflower
3. Onions
4. Potatoes
5. Broccoli
6. Eggplant
7. Purple tomatoes
8. Yellow squash/zucchini
9. Purple cabbage
10. Inexperienced bell pepper
11. Purple bell pepper
12. Cucumbers
13. Recent garlic
14. Recent herbs (seems to be parsley)
15. Inexperienced banana peppers or chili peppers

The greens seem like very recent and vibrant in colour, organized on what appears like a wood floor or reducing board. The composition reveals a pleasant number of colours, from deep purples to shiny reds, oranges, and greens, making it visually interesting and suggesting a wholesome, nutritious collection of produce.
[end of claude35_vqa output]

[suggestions]

1. The greens are each in and round a basket, so relative_position would assist in figuring out their places with respect to the basket as an anchor level.

2. nested_structure applies as a result of some greens are nested/contained throughout the basket whereas others are organized exterior it.

3. colour can be helpful because the greens are distinctly coloured (pink tomatoes, orange carrots, inexperienced broccoli, purple cabbage, and so on.), making colour a key identifier.

4. The issue doesn't contain small textual content, grid patterns, monitoring in video, evaluating sizes particularly, or coping with extraordinarily small objects in a big picture. The depth place is seen however not essential to the detection job.


[suggestion 0]
You could discover the colour of objects within the picture, you should utilize the next code to assist with this job:

import numpy as np
import cv2

color_ranges = {
    "red_lower": ((0, 100, 100), (int(179 * 20 / 360), 255, 255)),
    "orange": ((int(179 * 21 / 360), 100, 100), (int(179 * 50 / 360), 255, 255)),
    "yellow": ((int(179 * 51 / 360), 100, 100), (int(179 * 70 / 360), 255, 255)),
    "inexperienced": ((int(179 * 71 / 360), 100, 100), (int(179 * 150 / 360), 255, 255)),
    "cyan": ((int(179 * 151 / 360), 100, 100), (int(179 * 180 / 360), 255, 255)),
    "blue": ((int(179 * 181 / 360), 100, 100), (int(179 * 265 / 360), 255, 255)),
    "purple": ((int(179 * 266 / 360), 100, 100), (int(179 * 290 / 360), 255, 255)),
    "pink": ((int(179 * 291 / 360), 100, 100), (int(179 * 330 / 360), 255, 255)),
    "red_upper": ((int(179 * 331 / 360), 100, 100), (179, 255, 255)),
    "white": ((0, 0, 200), (179, 25, 255)),
    "grey": ((0, 0, 50), (179, 50, 200)),
    "black": ((0, 0, 0), (179, 255, 30)),
}

def get_color(picture, color_ranges):
    hsv_image = cv2.cvtColor(picture, cv2.COLOR_RGB2HSV)
    detected_colors = {}
    for colour, (decrease, higher) in color_ranges.objects():
        upper_range = np.array(higher, dtype=np.uint8)
        lower_range = np.array(decrease, dtype=np.uint8)
        masks = cv2.inRange(hsv_image, lower_range, upper_range)
        detected_pixels = cv2.countNonZero(masks)
        detected_colors[color] = detected_pixels

    if "red_lower" in detected_colors and "red_upper" in detected_colors:
        detected_colors["red"] = detected_colors["red_lower"] + detected_colors["red_upper"]
        del detected_colors["red_lower"]
        del detected_colors["red_upper"]
    return sorted(detected_colors, key=detected_colors.get, reverse=True)[0]

[end of suggestion 0]
[suggestion 1]
You are attempting to rely objects inside objects, or a nested construction. You possibly can clear up this by first detecting the outer objects, then cropping the picture to the bounding field of every outer object and detecting the inside objects. You need to use the next code to assist with this job:

all_dets = <an object detection software>("object", picture)

top, width = picture.form[:2]

def space(field):
    return (field[2] - field[0]) * (field[3] - field[1])

# solely test inside detections on high 25 largest outer detections
largest_dets = sorted(dets, key=lambda x: space(x["bbox"]), reverse=True)[:25]
for det in largest_dets:
    x1 = int(det["bbox"][0] * width)
    y1 = int(det["bbox"][1] * top)
    x2 = int(det["bbox"][2] * width)
    y2 = int(det["bbox"][3] * top)

    crop = picture[y1:y2, x1:x2]
    crop_height, crop_width = crop.form[:2]

    inner_dets = <an object detection software>("object", crop)
    for inner_det in inner_dets:
        x1_inner = int(inner_det["bbox"][0] * crop_width)
        y1_inner = int(inner_det["bbox"][1] * crop_height)
        x2_inner = int(inner_det["bbox"][2] * crop_width)
        y2_inner = int(inner_det["bbox"][3] * crop_height)

        bbox = [
            x1 + x1_inner,
            y1 + y1_inner,
            x1 + x2_inner,
            y1 + y2_inner,
        ]
        norm_bbox = [
            bbox[0] / width,
            bbox[1] / top,
            bbox[2] / width,
            bbox[3] / top,
        ]
        all_dets.append(
            {
                "label": inner_det["label"],
                "rating": inner_det["score"],
                "bbox": norm_bbox,
            }
        )

[end of suggestion 1]
[suggestion 2]
You are attempting to find an objects relative to 'anchor' objects. The 'anchor' objects might be detected superb, however there are lots of of the opposite objects and also you solely need to return those which can be positioned relative to the 'anchor' objects as specified by the person. You need to use the next code to assist with this job:

# First discover a mannequin that may detect the situation of the anchor objects
anchor_dets = <a mannequin that may detect the situation of the anchor objects>("anchor object", picture)
# Then discover a mannequin that may detect the situation of the relative objects
relative_dets = <a mannequin that may detect the situation of the relative objects>("relative object", picture)

# This will provide you with relative objects 'above' the anchor objects since it is the
# distance between the decrease left nook of the relative object and the higher left
# nook of the anchor object. The remaining features can be utilized to get the opposite
# relative positions.
def above_distance(box1, box2):
    return (box1["bbox"][0] - box2["bbox"][0]) ** 2 + (
        box1["bbox"][3] - box2["bbox"][1]
    ) ** 2

def below_distance(box1, box2):
    return (box1["bbox"][0] - box2["bbox"][0]) ** 2 + (
        box1["bbox"][1] - box2["bbox"][3]
    ) ** 2

def right_distance(box1, box2):
    return (box1["bbox"][0] - box2["bbox"][2]) ** 2 + (
        box1["bbox"][1] - box2["bbox"][1]
    ) ** 2

def left_distance(box1, box2):
    return (box1["bbox"][2] - box2["bbox"][0]) ** 2 + (
        box1["bbox"][1] - box2["bbox"][1]
    ) ** 2

closest_boxes = []
for anchor_det in anchor_dets:
    # You need to use any of the above features to get the relative place
    distances = [
        (relative_det, above_distance(relative_det, anchor_det))
        for relative_det in relative_dets
    ]
    # You could seize the closest object for every of the anchors. This line will give
    # you the field instantly above the anchor field (or under, left, proper relying on
    # the perform used)
    closest_box = min(distances, key=lambda x: x[1])[0]
    closest_boxes.append(closest_box)

[end of suggestion 2]
[end of suggestions]

<rely>9</rely>

Step 3: Execution

The deliberate steps are executed utilizing the VisionAgent Library and Instruments Library. This entails loading the picture, working fashions, and producing detection outputs.

Planner

Primarily based on Claude’s description and the strategies, I have to detect greens each in and across the basket. It is a nested construction downside the place I first have to detect the basket after which detect greens in relation to it. Let me first get a software that may detect each the basket and greens.

Instance Execution Code

picture = load_image('vegetablebasket.jpg')
claude35_vqa('Are you able to describe this picture? What greens do you see in and across the basket?', [image])
suggestion('How can I detect greens in and across the basket on this picture?', [image])
get_tool_for_task('Detect basket and greens within the picture', [image])

Commentary and Ultimate Output

After execution, VisionAgent produces structured outcomes:

Detected greens categorized inside and outdoors the basket.
Bounding field coordinates for every detected vegetable.
A deployable AI mannequin that can be utilized for real-world purposes.

Output

2. Immediate: “Determine pink automotive within the video”

Enter Video: Video Hyperlink

Step 1: Dialog

I’ll allow you to write code to determine and monitor pink automobiles within the video. I’ll use object detection and color-based filtering to perform this job.

Step 2: Planner

I have to first perceive what’s within the video and get strategies on tips on how to determine a pink automotive. I’ll extract some frames and use claude35_vqa to explain the video content material, then get strategies.

Step 3: Execution

frames = extract_frames_and_timestamps('videoplayback(4).mp4', 1)
frames = [f["frame"] for f in frames]
claude35_vqa('Are you able to describe this video? Is there a pink automotive in it?', frames[:5])
suggestion('How can I determine the pink automotive within the video?', frames[:5])

Commentary

[claude35_vqa output]
Sure, there's a pink automotive (#21) on this video sequence, which seems to be a NASCAR race clip displaying an incident between two automobiles - a pink #21 automotive and a black/gold #3 automotive with Winchester branding. The video reveals them making contact round flip 4 of the monitor, with the automobiles getting shut to one another and probably buying and selling paint. The sequence is labeled as "Josh Berry vs Austin Dillon" and reveals what seems to be a tense racing second between the 2 drivers. The textual content overlays point out "WOAH TURN 4", "CONTACT", "BERRY", "A LOT", and "THERE" because the sequence progresses.
[end of claude35_vqa output]

[suggestions]

Primarily based on the person's request to determine the pink automotive within the video and the offered video frames, that is primarily a case of:

1. finding_features_with_video_tracking - The sequence reveals a NASCAR race interplay between two automobiles (Josh Berry and Austin Dillon) the place monitoring the pink #21 automotive by way of the flip and speak to is essential
2. colour - The precise request mentions figuring out the "pink automotive", making colour identification related
3. relative_position - The interplay between the 2 automobiles requires understanding their positions relative to one another by way of the flip and speak to

The opposite classes do not apply as a result of:
- This is not about small textual content or massive photos
- There are not any grid patterns or lacking parts
- This is not about evaluating sizes or nested constructions
- Whereas there may be some depth notion in racing, the request is not particularly about figuring out depth place


[suggestion 0]
You could discover the colour of objects within the picture, you should utilize the next code to assist with this job:

import numpy as np
import cv2

color_ranges = {
    "red_lower": ((0, 100, 100), (int(179 * 20 / 360), 255, 255)),
    "orange": ((int(179 * 21 / 360), 100, 100), (int(179 * 50 / 360), 255, 255)),
    "yellow": ((int(179 * 51 / 360), 100, 100), (int(179 * 70 / 360), 255, 255)),
    "inexperienced": ((int(179 * 71 / 360), 100, 100), (int(179 * 150 / 360), 255, 255)),
    "cyan": ((int(179 * 151 / 360), 100, 100), (int(179 * 180 / 360), 255, 255)),
    "blue": ((int(179 * 181 / 360), 100, 100), (int(179 * 265 / 360), 255, 255)),
    "purple": ((int(179 * 266 / 360), 100, 100), (int(179 * 290 / 360), 255, 255)),
    "pink": ((int(179 * 291 / 360), 100, 100), (int(179 * 330 / 360), 255, 255)),
    "red_upper": ((int(179 * 331 / 360), 100, 100), (179, 255, 255)),
    "white": ((0, 0, 200), (179, 25, 255)),
    "grey": ((0, 0, 50), (179, 50, 200)),
    "black": ((0, 0, 0), (179, 255, 30)),
}

def get_color(picture, color_ranges):
    hsv_image = cv2.cvtColor(picture, cv2.COLOR_RGB2HSV)
    detected_colors = {}
    for colour, (decrease, higher) in color_ranges.objects():
        upper_range = np.array(higher, dtype=np.uint8)
        lower_range = np.array(decrease, dtype=np.uint8)
        masks = cv2.inRange(hsv_image, lower_range, upper_range)
        detected_pixels = cv2.countNonZero(masks)
        detected_colors[color] = detected_pixels

    if "red_lower" in detected_colors and "red_upper" in detected_colors:
        detected_colors["red"] = detected_colors["red_lower"] + detected_colors["red_upper"]
        del detected_colors["red_lower"]
        del detected_colors["red_upper"]
    return sorted(detected_colors, key=detected_colors.get, reverse=True)[0]

[end of suggestion 0]
[suggestion 1]
First attempt to clear up the issue utilizing a VQA software earlier than utilizing the monitoring method for a sooner and simpler answer:

reply = <VQA software to reply your query>("<your immediate right here>", picture)

If that doesn't work, you'll be able to monitor the objects within the video after which determine options on these objects. You could first get a software that may monitor objects in a video, after which for every object discover one other software to determine the options on the thing. You need to use the next code to assist with this job:

track_predictions = <object monitoring software>("object", video_frames)


# Step 1: undergo every body and every prediction and extract the expected bounding packing containers as crops
obj_to_info = {}
for body, frame_predictions in zip(video_frames, track_predictions):
    for obj in frame_predictions:
        if obj["label"] not in obj_to_info:
            obj_to_info[obj["label"]] = []
        top, width = body.form[:2]
        crop = body[
            int(obj["bbox"][1] * top) : int(obj["bbox"][3] * top),
            int(obj["bbox"][0] * width) : int(obj["bbox"][2] * width),
            :,
        ]
        # For every crop use an object detection software, VQA software or classification software to determine if the thing comprises the options you need
        output = <software to determine your characteristic or a number of options>("<your characteristic(s) right here>", crop)
        obj_to_info[obj["label"]].prolong(output)

print(f"{len(obj_to_info)} objects tracked")

objects_with_info = set()
for infos in obj_to_info:
    for information in data:
        if data["label"] == "<your characteristic right here>":
            objects_with_info.add(data)
            break

print(f"{len(objects_with_info)} objects with options discovered")

[end of suggestion 1]
[suggestion 2]
You are attempting to find an objects relative to 'anchor' objects. The 'anchor' objects might be detected superb, however there are lots of of the opposite objects and also you solely need to return those which can be positioned relative to the 'anchor' objects as specified by the person. You need to use the next code to assist with this job:

# First discover a mannequin that may detect the situation of the anchor objects
anchor_dets = <a mannequin that may detect the situation of the anchor objects>("anchor object", picture)
# Then discover a mannequin that may detect the situation of the relative objects
relative_dets = <a mannequin that may detect the situation of the relative objects>("relative object", picture)

# This will provide you with relative objects 'above' the anchor objects since it is the
# distance between the decrease left nook of the relative object and the higher left
# nook of the anchor object. The remaining features can be utilized to get the opposite
# relative positions.
def above_distance(box1, box2):
    return (box1["bbox"][0] - box2["bbox"][0]) ** 2 + (
        box1["bbox"][3] - box2["bbox"][1]
    ) ** 2

def below_distance(box1, box2):
    return (box1["bbox"][0] - box2["bbox"][0]) ** 2 + (
        box1["bbox"][1] - box2["bbox"][3]
    ) ** 2

def right_distance(box1, box2):
    return (box1["bbox"][0] - box2["bbox"][2]) ** 2 + (
        box1["bbox"][1] - box2["bbox"][1]
    ) ** 2

def left_distance(box1, box2):
    return (box1["bbox"][2] - box2["bbox"][0]) ** 2 + (
        box1["bbox"][1] - box2["bbox"][1]
    ) ** 2

closest_boxes = []
for anchor_det in anchor_dets:
    # You need to use any of the above features to get the relative place
    distances = [
        (relative_det, above_distance(relative_det, anchor_det))
        for relative_det in relative_dets
    ]
    # You could seize the closest object for every of the anchors. This line will give
    # you the field instantly above the anchor field (or under, left, proper relying on
    # the perform used)
    closest_box = min(distances, key=lambda x: x[1])[0]
    closest_boxes.append(closest_box)

[end of suggestion 2]
[end of suggestions]

<rely>9</rely>

Planner

Primarily based on Claude’s description and the strategies, I would like to trace automobiles within the video and determine the pink one. Let me first get a software for monitoring automobiles within the video.

Execution

get_tool_for_task('Monitor automobiles in a NASCAR race video', frames[:5])

Output

Conclusion

VisionAgent is reworking the way in which builders construct, check, and deploy AI-driven imaginative and prescient purposes. By automating tedious processes and offering ready-to-use instruments, it considerably reduces improvement time whereas guaranteeing high-quality outcomes. Whether or not you might be an AI researcher, a developer, or a enterprise trying to implement laptop imaginative and prescient options, VisionAgent offers a quick, versatile, and scalable approach to obtain your targets.

With ongoing developments in AI, VisionAgent is predicted to evolve additional, incorporating even extra highly effective fashions and increasing its ecosystem to assist a wider vary of purposes. Now could be the right time to discover how VisionAgent can improve your AI-driven imaginative and prescient initiatives.

Hello, I’m Pankaj Singh Negi – Senior Content material Editor | Keen about storytelling and crafting compelling narratives that rework concepts into impactful content material. I really like studying about expertise revolutionizing our way of life.

We use cookies important for this web site to perform effectively. Please click on to assist us enhance its usefulness with extra cookies. Find out about our use of cookies in our Privateness Coverage & Cookies Coverage.

Present particulars

Andrew Ng’s VisionAgent: Streamlining Imaginative and prescient AI Options

4. Key Takeaways

Sensible Working of VisionAgent

1. Immediate: “Detect greens in and across the basket”

Step 1: Dialog

Enter Picture

Dialog

Step 2: Planner

Planning Steps

Instance Plan

Execution:

Commentary

Step 3: Execution

Planner

Instance Execution Code

Commentary and Ultimate Output

Output

2. Immediate: “Determine pink automotive within the video”

Step 1: Dialog

Step 2: Planner

Step 3: Execution

Commentary

Planner

Execution

Output

Conclusion

brahmaid

csrftoken

Identityid

sessionid

g_state

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

_gid

_ga_#

_gat_#

accumulate

AEC

G_ENABLED_IDPS

test_cookie

_we_us

WebKlipperAuth

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

go to

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55percent40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

_fbp

fr

bscookie

lidc

bcookie

aam_uuid