Conversational AI, and specifically voicebots, are a transformational technology for many industries. In this blog, we would like to share a high-level process flow of how a Conversational AI voicebot works, the types of technologies that are used to deliver a near-human experience, and tips on starting your development path. On a high level, the process should behave like this. A speaker talks into a phone, intercom, or app, and the audio is captured by a system, which sends the audio to a speech-to-text platform to turn the audio into text. The speech-to-text platform may also send other metadata to help the Natural Language Processing/Understand solution determine the intent of the conversation. NLP/NLU solutions take the text and try to find the intent of the conversation. Are they ordering food, do they need their billing information, do they want to make a payment, are they interested in new products, do they need technical support. After the NLP/NLU determines the intent, it is fed into the business logic to find the correct response for that intent. The response engine could create the following responses, "your balance is $1000", "your next bill is $56.79", "Or your order is one Big Mag, one large fries, and a Diet Cherry Coke." This response which is in text form is then fed into a text-to-speech solution to create a vocal response to the speaker. Responding naturally requires this pipeline to have very low latency, typically less than 1 second with shorter latency being a large factor in the success of a solution.
Audio Capture Audio capture is normally done with existing infrastructure; UCaaS, CCaaS, on-device applications, smart speakers, PBX, or VOIP. Depending on the speech-to-text (STT) solution, the audio capture may need to be converted into different file formats for real-time streaming to the STT.
Speech-to-Text Speech-to-Text transcribes the speech in the audio capture into text for the NLU/NLP to parse and use. The more accurate the STT, the better the results from the NLU/NLP. In addition, some STT systems also provide diarization, audio sentiment, speaker ID, speaker isolation, noise reduction, and metadata on pitch, pace, tone, and utterances. This is normally a separate best-of-breed vendor. STT providers include Deepgram, Google STT, Amazon Transcribe, Nuance, and IBM Watson.
Natural Language Processing and Understanding NLU/NLP is the main processing of the voicebot to turn words, sentences, sentiment, audio metadata into intent. What does the speaker want to do or convey? It matches the words to the intent so the business logic and response can be determined. More advanced systems are contextually aware and know the user context and preferences. They add behavior prediction, other user data, and the conversational history to process the audio and provide a more accurate response. This can be part of a complete voicebot solution or a separate best-of-breed vendor such as OneReach.ai, Rasa, or Cognigy.ai.
Validation and Business Logic After the intent is presented to this step, decisions are made to determine the response. Is the intent to query for an answer, such as an account balance, how to get a new pin, order a new product, create an account, or get technical support? This can be a decision tree logic or an AI that can direct the intent to the right area. This process is normally bundled with the NLU/NLP step.
Response The response is the text output from the Validation and Business Logic step or the result of a query from a knowledge database or backend system. These responses can be pre-set, scripted responses, or AI-generated responses in text. More advanced systems can add small talk, compound responses, and summarize the discussion. This process is also normally bundled with the NLU/NLP step with connectors to knowledge bases, backend systems, or other AI systems.
Text-to-Speech This last Text-to-Speech step takes the text response and presents it to the speaker in audio form. In the design of this step, you would choose the voice to use, personality, language, accent, and dialect. Advanced systems can personalize the voice to what the user likes, add empathetic voice emotions ("I hear you are frustrated"), add intermediate dialogue for long processing ("Hold on while I look that up"), and add confirmation cues ("aha", "hm", "huh"). This step is normally a separate best-of-breed vendor. Overall Conversational AI Voicebot providers include: Agara.ai, Elerian.ai, and Uniphore These voicebots are focused on short conversations, not one-word responses from the customer. For example, these voicebots intake conversations like "I would like to know my account balance please" versus IVRs that only accept one word, "balance". This voicebot can parse out the keywords for the business logic to provide the right response.
Tips for Choosing Your Development Path Metrics for Success
How are you going to measure success for your Conversational AI voicebot? Do you have hard metrics that directly relate to great customer experience? One metric to consider is the overall latency of the solution. In other words, from the time the user stops talking, how long does it take your voicebot to respond? Or can your voicebot respond when there is a short pause, like in real conversations? For your STT, is a general Word Error Rate (WER) acceptable? Or do you need to be looking at more specific WER-the word error rate on specific terms, keywords, alphanumeric numbers, or jargon essential to interact with machines? Some metrics you might consider are listed in this technical whitepaper.
Tip - Map the performance metrics to the business case success or better customer experience before you start comparing solutions. Find the metrics that matter and eliminate the metrics that don't. In STT for example, is a general WER acceptable? Remember, even a low WER doesn't distinguish between important and unimportant words. But a specific WER focused on your product names might give you better performance.
Build or buy - This is always one of the first questions to answer before selecting a new technology implementation. Variables in this decision are initial costs, implementation costs, maintenance costs, resources, timeline, internal experience and skills needed, learning curve, innovation control, etc. For voicebots, we are still in the early stages of technology evolution and hence there are many changes. Kevin Fredrick, the Managing Partner of OneReach.ai, expressed this best when he said, "Building a Conversational AI voicebot is like planning to summit a mountain. Those who are looking for an 'easy button' get frustrated and quit. The ones who think it will be too hard, don't ever start. It is the ones who know the challenge is worth it and have the right partners and use the right tools who make the summit."
Tip - Unless you have a team very experienced in voicebots and NLP/NLU, buying might be your best bet at this time. However, be flexible in your implementation as technology is changing rapidly and you don't want to be locked into one platform or vendor.
Communication Platform Add on, All in One, or Best of Breed - As we are in the early stages of this technology there is a lot of choice from communication platforms that have an add on voicebot, voicebot specific companies, and best of breed solutions for STT, NLP/NLU, and TTS. Which path you take again depends on your experience. Can your team manage the implementation and maintenance of various vendors that may make up your voicebot or do you want one vendor to handle it all? Is customization of the voicebot, control of innovation, speed of implementation, or ease of vendor management important? With overall trends toward specialization, best-of-breed solutions individually are generally better than an all-in-one solution, and an all-in-one is typically better than add-on solutions.
Tip - This decision hinges on time, experience, and resources but also control of your innovation and roadmap. Best of breed solutions are great if you want to control your innovation and may want to switch out individual component providers as new technologies emerge, All-in-one is the simpler solution for any company but you are stuck with the solution's innovation timeline. Look very closely at Add-on solutions as they may be behind in technology and innovation.
General or Tailored - Each technology provider may focus on industries, regions, markets, or use cases or offer a general technology tailored to your specific use case. You need to determine how the technology fits your use case, language, culture, and compliance needs. Some voicebots may be general and you customize the responses and logic while others are specialized for banking and include customer ID by voice. Do you need a voicebot to be able to speak multiple languages or understand different accents or dialects?
Tip - Study what is available in the current market and what the analysts are saying. A good start is this Gartner paper, "Architecture of Conversational AI Platform". List out the end customer priorities for your voicebot and the features you need in priority order. Then screen the choices down. We have found that a more tailored approach works best and provides a better customer experience than a more general voicebot.
We hope this gave you the insight you need to take the next step in starting or advancing your conversational AI experience. We want to emphasize that this technology is changing rapidly and a one-stop-shop solution may sound great but may end up locking you into technology that's quickly dated. So make your implementations are flexible enough to allow you to pivot for the best voicebot experience possible now and in the future.
If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions.
More with these tags:
Share your feedback
Was this article useful or interesting to you?
We appreciate your response.
Why Deepgram Built its Platform in Rust
- Adam Sypniewski
Exploring OpenAI Whisper Speech Recognition
- Julia Strout
Try Whisper: OpenAI's Speech Recognition Model in 1 Minute
- Michael Jolley
Use OpenAI Whisper Speech Recognition with the Deepgram API
- Scott Stephenson