Jean-Louis Quéguiner, Founder & CEO of Gladia – Interview Sequence

December 31, 2024

12

Jean-Louis Quéguiner is the Founder and CEO of Gladia. He beforehand served as Group Vice President of Knowledge, AI, and Quantum Computing at OVHcloud, considered one of Europe’s main cloud suppliers. He holds a Grasp’s Diploma in Symbolic AI from the College of Québec in Canada and Arts et Métiers ParisTech in Paris. Over the course of his profession, he has held vital positions throughout numerous industries, together with monetary knowledge analytics, machine studying functions for real-time digital promoting, and the event of speech AI APIs.

Gladia supplies superior audio transcription and real-time AI options for seamless integration into merchandise throughout industries, languages, and know-how stacks. By optimizing state-of-the-art ASR and generative AI fashions, it ensures correct, lag-free speech and language processing. Gladia’s platform additionally permits real-time extraction of insights and metadata from calls and conferences, supporting key enterprise use circumstances akin to gross sales help and automatic buyer assist.

What impressed you to sort out the challenges in speech-to-text (STT) know-how, and what gaps did you see out there?

Once I based Gladia, the preliminary objective was broad—an AI firm that may make complicated know-how accessible. However as we delved deeper, it turned clear that voice know-how was probably the most damaged and but most important space to concentrate on.

Voice is central to our each day lives, and most of our communication occurs by way of speech. But, the instruments accessible for builders to work with voice knowledge have been insufficient by way of pace, accuracy, and value—particularly throughout languages.

I wished to repair that, to unpack the complexity of voice know-how and repackage it into one thing easy, environment friendly, highly effective and accessible. Builders shouldn’t have to fret concerning the intricacies of AI fashions or the nuances of context size in speech recognition. My objective was to create an enterprise-grade speech-to-text API that labored seamlessly, whatever the underlying mannequin or know-how—a real plug-and-play answer.

What are a few of the distinctive challenges you encountered whereas constructing a transcription answer for enterprise use?

Relating to speech recognition, pace and accuracy—the 2 key efficiency indicators on this area—are inversely proportional by design. Which means enhancing one will compromise the opposite, a minimum of to some extent. The price issue, to a giant extent, outcomes from the supplier’s alternative between pace and high quality.

When constructing Gladia, our objective was to seek out the proper steadiness between these two components, all whereas making certain the know-how stays accessible to startups and SMEs. Within the course of we additionally realized that the foundational ASR fashions like OpenAI’s Whisper, which we labored with extensively, are biased, skewering closely in direction of English attributable to their coaching knowledge, which leaves plenty of languages under-represented.

So, along with fixing the speed-accuracy tradeoff, it was vital to us— as a European, multilingual workforce—to optimize and fine-tune our core fashions to construct a really world API that helps companies function throughout languages.

How does Gladia differentiate itself within the crowded AI transcription market? What makes your Whisper-Zero ASR distinctive?

Our new real-time engine (Gladia Actual Time) achieves an industry-leading 300 ms latency. Along with that, it’s in a position to extract insights from a name or assembly with the so-called “audio intelligence” add-ons or options, like named entity recognition (NER) or sentiment evaluation.

To our information, only a few rivals are in a position to present each transcription and insights at such excessive latency (lower than 1s end-to-end) – and do all of that precisely in languages aside from English. Our languages assist extends to over 100 languages at this time.

We additionally put a particular emphasis on making the product really stack agnostic. Our API is suitable with all present tech stacks and telephony protocols, together with SIP, VoIP, FreeSwitch and Asterisk. Telephony protocols are particularly complicated to combine with, so we imagine this product side can convey great worth to the market.

Hallucinations in AI fashions are a big concern, particularly in real-time transcription. Are you able to clarify what hallucinations are within the context of STT and the way Gladia addresses this downside?

Hallucination normally happens when the mannequin lacks information or doesn’t have sufficient context on the subject. Though fashions can produce outputs tailor-made to a request, they’ll solely reference data that existed on the time of their coaching, and that is probably not up-to-date. The mannequin will create coherent responses by filling in gaps with data that sounds believable however is inaccurate.

Whereas hallucinations turned identified within the context of LLMs first, they happen with speech recognition fashions— like Whisper ASR, a number one mannequin within the area developed by OpenAI – as nicely. Whisper’s hallucinations are like these of LLMs attributable to the same structure, so it’s an issue that considerations generative fashions, which can be in a position to predict the phrases that observe primarily based on the general context. In a method, they ‘invent’ the output. This method might be contrasted with extra conventional, acoustic-based ASR architectures that match the enter sound to output in a extra mechanical method

In consequence, you could discover phrases in a transcript that weren’t truly stated, which is clearly problematic, particularly in fields like medication, the place a mistake of this sort can have grave penalties.

There are a number of strategies to handle and detect hallucinations. One widespread method is to make use of a retrieval-augmented era (RAG) system, which mixes the mannequin’s generative capabilities with a retrieval mechanism to cross-check details. One other technique entails using a “chain of thought” method, the place the mannequin is guided by way of a collection of predefined steps or checkpoints to make sure that it stays on a logical path.

One other technique for detecting hallucinations entails utilizing techniques that assess the truthfulness of the mannequin’s output throughout coaching. There are benchmarks particularly designed to judge hallucinations, which contain evaluating totally different candidate responses generated by the mannequin and figuring out which one is most correct.

We at Gladia have experimented with a mix of methods when constructing Whisper-Zero, our proprietary ASR that removes just about all hallucinations. It’s confirmed wonderful leads to asynchronous transcription, and we’re at the moment optimizing it for real-time to attain the identical 99.9% data constancy.

STT know-how should deal with a variety of complexities like accents, noise, and multi-language conversations. How does Gladia method these challenges to make sure excessive accuracy?

Language detection in ASR is a particularly complicated activity. Every speaker has a novel vocal signature, which we name options. By analyzing the vocal spectrum, machine studying algorithms can carry out classifications, utilizing the Mel Frequency Cepstral Coefficients (MFCC) to extract the primary frequency traits.

MFCC is a technique impressed by human auditory notion. It’s a part of the “psychoacoustic” area, specializing in how we understand sound. It emphasizes decrease frequencies and makes use of methods like normalized Fourier decomposition to transform audio right into a frequency spectrum.

Nonetheless, this method has a limitation: it is primarily based purely on acoustics. So, when you communicate English with a robust accent, the system might not perceive the content material however as an alternative decide primarily based in your prosody (rhythm, stress, intonation).

That is the place Gladia’s progressive answer is available in. We have developed a hybrid method that mixes psycho-acoustic options with content material understanding for dynamic language detection.

Our system would not simply hearken to the way you communicate, but additionally understands what you are saying. This twin method permits for environment friendly code-switching and would not let sturdy accents get misrepresented/misunderstood.

Code-switching—which is amongst our key differentiators—is a very vital function in dealing with multilingual conversations. Audio system might change between languages mid-conversation (and even mid-sentence), and the power of the mannequin to transcribe precisely on the fly regardless of the change is essential.

Gladia API is exclusive in its means to deal with code-switching with this many language pairs with a excessive degree of accuracy and performs nicely even in noisy environments, identified to scale back the standard of transcription.

Actual-time transcription requires ultra-low latency. How does your API obtain lower than 300 milliseconds latency whereas sustaining accuracy?

Retaining latency underneath 300 milliseconds whereas sustaining excessive accuracy requires a multifaceted method that blends {hardware} experience, algorithm optimization, and architectural design.

Actual-time AI isn’t like conventional computing—it’s tightly linked to the facility and effectivity of GPGPUs. I’ve been working on this area for almost a decade, main the AI division at OVHCloud (the largest cloud supplier within the EU), and discovered firsthand that it’s all the time about discovering the precise steadiness: how a lot {hardware} energy you want, how a lot it prices, and the way you tailor the algorithms to work seamlessly with that {hardware}.

Efficiency in actual time AI comes from successfully aligning our algorithms with the capabilities of the {hardware}, making certain each operation maximizes throughput whereas minimizing delays.

However it’s not simply the AI and {hardware}. The system’s structure performs a giant function too, particularly the community, which may actually influence latency. Our CTO, who has deep experience in low-latency community design from his time at Sigfox (an IoT pioneer), has optimized our community setup to shave off worthwhile milliseconds.

So, it’s actually a mixture of all these components—sensible {hardware} decisions, optimized algorithms, and community design—that lets us persistently obtain sub-300ms latency with out compromising on accuracy.

Gladia goes past transcription with options like speaker diarization, sentiment evaluation, and time-stamped transcripts. What are some progressive functions you’ve seen your shoppers develop utilizing these instruments?

ASR unlocks a variety of functions to platforms throughout verticals, and it’s been superb to see what number of really pioneering firms have emerged within the final two years, leveraging LLMs and our API to construct cutting-edge, aggressive merchandise. Listed here are some examples:

Sensible note-taking: Many consumers are constructing instruments for professionals who must rapidly seize and manage data from work conferences, pupil lectures, or medical consultations. With speaker diarization, our API can establish who stated what, making it straightforward to observe conversations and assign motion gadgets. Mixed with time-stamped transcripts, customers can soar straight to particular moments in a recording, saving time and making certain nothing will get misplaced in translation.
Gross sales enablement: Within the gross sales world, understanding buyer sentiment is all the things. Groups are utilizing our sentiment evaluation function to realize real-time insights into how prospects reply throughout calls or demos. Plus, time-stamped transcripts assist groups revisit key components of a dialog to refine their pitch or tackle consumer considerations extra successfully. For this use case particularly, NER can also be key to figuring out names, firm particulars, and different data that may be extracted from gross sales calls to feed the CRM mechanically.
Name heart help: Corporations within the contract heart area are utilizing our API to supply stay help to brokers, in addition to flagging buyer sentiment throughout calls. Speaker diarization ensures that issues being stated are assigned to the precise individual, whereas time-stamped transcripts allow supervisors to evaluate essential moments or compliance points rapidly. This not solely improves the shopper expertise – with higher on-call decision price and high quality monitoring – but additionally boosts agent productiveness and satisfaction.

Are you able to focus on the function of customized vocabularies and entity recognition in enhancing transcription reliability for enterprise customers?

Many industries depend on specialised terminology, model names, and distinctive language nuances. Customized vocabulary integration permits the STT answer to adapt to those particular wants, which is essential for capturing contextual nuances and delivering output that precisely displays what you are promoting wants. As an example, it lets you create an inventory of domain-specific phrases, akin to model names, in a particular language.

Why it’s helpful: Adapting the transcription to the precise vertical lets you reduce errors in transcripts, attaining a greater consumer expertise. This function is particularly essential in fields like medication or finance.

Named entity recognition (NER) extracts and identifies key data from unstructured audio knowledge, akin to names of individuals, organizations, areas, and extra. A typical problem with unstructured knowledge is that this essential data isn’t readily accessible—it is buried throughout the transcript.

To resolve this, Gladia developed a structured Key Knowledge Extraction (KDE) method. By leveraging the generative capabilities of its Whisper-based structure—much like LLMs—Gladia’s KDE captures context to establish and extract related data straight.

This course of might be additional enhanced with options like customized vocabulary and NER, permitting companies to populate CRMs with key knowledge rapidly and effectively.

In your opinion, how is real-time transcription remodeling industries akin to buyer assist, gross sales, and content material creation?

Actual-time transcription is reshaping these industries in profound methods, driving unbelievable productiveness good points, coupled with tangible enterprise advantages.

First, real-time transcription is a game-changer for assist groups. Actual-time help is vital to enhancing the decision price because of sooner responses, smarter brokers, and higher outcomes (by way of NSF, deal with occasions, and so forth). As ASR techniques get higher and higher at dealing with non-English languages and performing real-time translation, contact facilities can obtain a really world CX at decrease margins.

In gross sales, pace and spot-on insights are all the things. Equally to what occurs with name brokers, real-time transcription is what equips them with the precise insights on the proper time, enabling them to concentrate on what issues probably the most in closing offers.

For creators, real-time transcription is probably much less related at this time, however nonetheless stuffed with potential, particularly on the subject of stay captioning and translation throughout media occasions. Most of our present media clients nonetheless want asynchronous transcription, as pace is much less essential there, whereas accuracy is vital for functions like time-stamped video enhancing and subtitle era.

Actual-time AI transcription appears to be a rising pattern. The place do you see this know-how heading within the subsequent 5-10 years?

I really feel like this phenomenon, which we now name real-time AI, goes to be in every single place. Primarily, what we actually check with right here is the seamless means of machines to work together with individuals, the best way we people already work together with each other.

And when you have a look at any Hollywood film (like Her) set sooner or later, you’ll by no means see anybody there interacting with clever techniques by way of a keyboard. For me, that serves as the last word proof that within the collective creativeness of humanity, voice will all the time be the first method we work together with the world round us.

Voice, as the primary vector to combination and share human information, has been a part of human tradition and historical past for for much longer than writing. Then, writing took over as a result of it enabled us to protect our information extra successfully than counting on the group elders to be the guardians of our tales and knowledge.

GenAI techniques, able to understanding speech, producing responses, and storing our interactions, introduced one thing utterly new to the area. It’s one of the best of each phrases and one of the best of humanity actually. It offers us this distinctive energy and power of voice communication with the good thing about reminiscence, which beforehand solely written media may safe for us. This is the reason I imagine it’s going to be in every single place – it is our final collective dream.

Thanks for the nice interview, readers who want to be taught extra ought to go to Gladia.

Jean-Louis Quéguiner, Founder & CEO of Gladia – Interview Sequence

Related Articles

How the Fox Sports activities lawsuit involving Skip Bayless and Pleasure Taylor turned a sexist mess on-line

Elon Musk and Donald Trump unfold disinformation as wildfires rage by way of LA

Area Bears Ransomware: What You Want To Know

LEAVE A REPLY Cancel reply

Latest Articles

How the Fox Sports activities lawsuit involving Skip Bayless and Pleasure Taylor turned a sexist mess on-line

Elon Musk and Donald Trump unfold disinformation as wildfires rage by way of LA

Area Bears Ransomware: What You Want To Know

Information Annotation Developments for 2o25

Google Calendar is engaged on improved a number of calendar help