When somebody asks you a query that you simply have no idea the reply to, how do you usually reply? Most individuals’s first intuition is to inform them to “Google it”. This, after all, means to seize a digital machine, launch an internet browser, and kind a search question into Google, then to scan the outcomes for a solution. However that is 2024! Know-how has superior tremendously since “Google” first grew to become a verb a pair a long time in the past. Moreover, a text-based question just isn’t at all times one of the simplest ways to hunt out a solution — particularly if you’d like extra details about a close-by bodily object that’s not straightforward to explain.
A group on the MIT Media Lab has hacked collectively an answer that they imagine might make it simpler to get solutions to your burning questions. They’ve developed a prototype known as WatchThis that’s mounted on the wrist and makes use of laptop imaginative and prescient and enormous language fashions in a singular method to collect extra details about one’s environment. With WatchThis, you merely level and search.
The {hardware} elements (📷: Cathy Fang)
The machine consists of a Seeed Studio XIAO ESP32S3 Sense growth board, which is powered by a dual-core ESP32-S3 microcontroller and helps each Wi-Fi and Bluetooth wi-fi communication. That is paired with an OV2640 digicam module and a Seeed Studio Spherical Show for XIAO with a 1.28-inch touchscreen. A LiPo battery powers the system, and it’s connected to the wrist by way of a strap and a 3D-printed enclosure.
To make use of WatchThis, the show display flips as much as face the consumer. The digicam is connected to the rear facet of the show such that it could seize a video stream of what the wearer is pointing it at, and that video is proven on the show. Subsequent, the consumer factors their finger at an object of curiosity, then faucets on the display with the opposite hand. This causes the machine to seize a picture of the scene.
WatchThis in motion (📷: Cathy Fang)
A companion smartphone app is used to sort a query. That query, together with the captured picture, are despatched to OpenAI’s GPT-4o mannequin by way of the official API. This mannequin can analyze each photos and textual content and motive about them to reply questions. When the reply from the mannequin is returned it’s displayed on the display, on high of the captured picture, for a couple of seconds earlier than the machine returns to its regular working mode. Typical response instances are within the neighborhood of three seconds, making WatchThis moderately snappy.
The builders selected to make use of a smartphone app to permit customers to sort their questions for accuracy, however having to make use of one other machine to sort out a query is a bit clunky. One query instantly raised by this association is why your entire system doesn’t simply run on the smartphone. It already has an built-in digicam and positively has the power to make an API name, in any case. A voice recognition algorithm, whereas it could not be as correct, might make WatchThis way more pure and environment friendly to make use of. Maybe after some enhancements like this, we’ll inform folks to “WatchThis it” sooner or later. Hey, “Google it” didn’t at all times roll so simply off the tongue both, you recognize!