Glossary
Useful terms and definitions for Vapi & voice AI applications.
A
At-cost
“At-cost” is often use when discussing pricing. It means “without profit to the seller”. Vapi charges at-cost for requests made to STT, LLM, & TTS providers.
B
Backchanneling
A backchannel occurs when a listener provides verbal or non-verbal feedback to a speaker during a conversation.
Examples of backchanneling in English include such expressions as “yeah”, “OK”, “uh-huh”, “hmm”, “right”, and “I see”.
This feedback is often not semantically significant to the conversation, but rather serves to signify the listener’s attention, understanding, sympathy, or agreement.
E
Endpointing
See speech endpointing.
I
Inbound Call
This is a call received by an assistant from another phone number (w/ the assistant being the “person” answering). The call comes “in”-ward to a number (from an external caller) — hence the term “inbound call”.
Inference
You may often hear the term “run inference” when referring to running a large language model against an input prompt to receive text output back out.
The process of running a prompt against an LLM for output is called “inference”.
L
Large Language Model
Large Language Models (or “LLM”, for short) are machine learning models trained on large amounts of text, & later used to generate text in a probabilistic manner, “token-by-token”.
For further reading see large language model wiki.
LLM
See Large Language Model.
O
Outbound Call
This is a call made by an assistant to another target phone number (w/ the assistant being the “person” dialing). The call goes “out”-ward to another number — hence the term “outbound call”.
S
SDK
Stands for “Software Development Kit” — these are pre-packaged libraries & platform-specific building tools that a software publisher creates to expedite & increase the ease of integration for developers.
Speech Endpointing
Speech endpointing is the process of detecting the start and end of (a line of) speech in an audio signal. This is an important function in conversation turn detection.
A starting heuristic for the end of a user’s speech is the detection of silence. If someone does not speak for a certain amount of milliseconds, the utterance can be considered complete.
A more robust & ideal approach is to actually understand what the user is saying (as well as the current conversation’s state & the speech turn’s intent) to determine if the user is just pausing for effect, or actually finished speaking.
Vapi uses a combination of silence detection and machine learning models to properly endpoint conversation speech (to prevent improper interruption & encourage proper backchanneling).
Additional reading on speech endpointing can be found here & on Deepgram’s docs.
STT
An abbreviation used for “Speech-to-text”. The process of converting physical sound waves into raw transcript text (a process called “transcription”).
T
TTS
An abbreviation used for “Text-to-speech”. The process of converting raw text into playable audio data.
V
Voice-to-Voice
“Voice-to-voice” is often a term brought up in discussing voice AI system latency — the time it takes to go from a user finishing their speech (however that endpoint is computed) → to the AI agent’s first speech chunk/byte being played back on a client’s device.
Ideally, this process should happen in <1s, better if closer to 500-700ms (responding too quickly can be an issue as well). Voice AI applications must closely watch this metric to ensure their applications stay responsive & usable.
Was this page helpful?