[badge/API][yes/green]
[badge/Search][yes/green]
[badge/UI][yes/green]
Transcription engines, also known as speech-to-text or natural language processing (NLP), take recorded speech audio and output the words that were said. Depending on the engine's capabilities, the output is a simple sequence of words or a "lattice of confidence" expressing multiple options for how the words were spoken.
Engine input
Audio-processing engines can be stream processing engines, or, if processing will be stateless, they can be segment processing engines. A transcription engine is typically stateful and operates in stream processing mode. See Engine processing modes to learn more.
All engines that process audio will receive audio data with MIME type "audio/wav" (.mp3 and .mp4 are not natively supported). If your engine needs a format other than audio/wav, you will need to transcode incoming wav data to the appropriate target format using something like ffmpeg.
Engine output
Transcribed words can be reported in engine output by using the words array within the series array.
The official transcript validation contract json-schema is available here.
Example
Here is an example of the simplest type of transcript output:
{
"schemaId": "https://docs.veritone.com/schemas/vtn-standard/transcript.json",
"validationContracts": ["transcript"],
"series": [
{
"startTimeMs": 0,
"stopTimeMs": 300,
"words": [
{
"word": "this"
}
]
},
{
"startTimeMs": 300,
"stopTimeMs": 500,
"words": [
{
"word": "is"
}
]
},
{
"startTimeMs": 500,
"stopTimeMs": 800,
"words": [
{
"word": "a"
}
]
},
{
"startTimeMs": 800,
"stopTimeMs": 1200,
"words": [
{
"word": "sentence"
}
]
}
]
}
Adding confidence scores
In addition to the basic array of phrases, the confidence key can be used to indicate how confident the engine is in a given result (0-100%).
{
"schemaId": "https://docs.veritone.com/schemas/vtn-standard/transcript.json",
"validationContracts": ["transcript"],
"series": [
{
"startTimeMs": 0,
"stopTimeMs": 300,
"words": [
{
"word": "this",
"confidence": 0.2
}
]
},
{
"startTimeMs": 300,
"stopTimeMs": 500,
"words": [
{
"word": "is",
"confidence": 0.1
}
]
},
{
"startTimeMs": 500,
"stopTimeMs": 800,
"words": [
{
"word": "a",
"confidence": 0.1
}
]
},
{
"startTimeMs": 800,
"stopTimeMs": 1200,
"words": [
{
"word": "sentence",
"confidence": 0.1
}
]
}
]
}
Adding transcription lattices
A "transcription lattice" is the concept of expressing multiple possibilities of words that were spoken, with various confidences assigned to each. If an engine is able to output this type of nuanced information, the benefit is that aiWARE will index all possibilities so that the user will be able to search against all possibilities at once.
To output a lattice of confidence, you can write multiple entries to the words array and add the following keys to each entry:
| Key | Explanation |
confidence | How confident the engine is in this result as a percentage (0-100) |
bestPath | The best (most likely or most confident) path through the lattice can be constructed by only including the words where bestPath is true. Usually, this is the word in the words array with the highest confidence. |
utteranceLength | In some cases, one utterance might span multiple word slots. For example, if two possibilities for a phrase were "throne" or "their own", the first "throne" entry would have an utteranceLength of 2 while the "their" and "own" entries would have utterance lengths of 1. See the example below. |
When reporting lattices, use the following rules:
- For every
words array with multiple entries, a confidence value must be included on each entry. - For every
words array with multiple entries, one and only one of the entries must contain the bestPath key with a value of true - For every
words array with multiple entries, utteranceLength keys must be added to each entry.
{
"schemaId": "https://docs.veritone.com/schemas/vtn-standard/transcript.json",
"validationContracts": ["transcript"],
"series": [
{
"startTimeMs": 0,
"stopTimeMs": 200,
"words": [
{
"word": "veritone",
"confidence": 0.7,
"bestPath": true,
"utteranceLength": 2
},
{
"word": "baritone",
"confidence": 0.1,
"utteranceLength": 2
},
{
"word": "very",
"confidence": 0.1,
"utteranceLength": 1
},
{
"word": "berry",
"confidence": 0.1,
"utteranceLength": 1
}
]
},
{
"startTimeMs": 200,
"stopTimeMs": 300,
"words": [
{
"word": "veritone",
"confidence": 0.7,
"bestPath": true,
"utteranceLength": 1
},
{
"word": "baritone",
"confidence": 0.1,
"utteranceLength": 1
},
{
"word": "tone",
"confidence": 0.1,
"utteranceLength": 1
},
{
"word": "phone",
"confidence": 0.1,
"utteranceLength": 1
}
]
}
]
}
Translating transcripts
Some translation engines will take the outputs of transcription engines as input to their translation engines. To learn how those engines are built please see translating transcripts.