[API][yes]
[Search][no]
[UI][no]
Transcripts are one of the five input formats that translation engines can support. Transcripts are the engine output of a transcription (speech-to-text) engine.
[Warning] In order to use transcript translation engines, it is necessary to chain the output of a transcription engine to the input of the translation engine in one job. The platform will not currently handle this routing for you.
Engine input
Transcript translation engines should be implemented as segment processing engines. Each segment will be a .aion snippet containing recognized text (conforming to the transcript validation contract).
Example input
{
"schemaId": "https://docs.veritone.com/schemas/vtn-standard/master.json",
"validationContracts": [
"transcript"
],
"series": [
{
"startTimeMs": 0,
"stopTimeMs": 300,
"language": "en",
"words": [
{
"word": "this"
}
]
},
{
"startTimeMs": 300,
"stopTimeMs": 500,
"language": "en",
"words": [
{
"word": "is"
}
]
},
{
"startTimeMs": 500,
"stopTimeMs": 800,
"language": "en",
"words": [
{
"word": "a"
}
]
},
{
"startTimeMs": 800,
"stopTimeMs": 1200,
"language": "en",
"words": [
{
"word": "sentence"
}
]
},
{
"startTimeMs": 1200,
"stopTimeMs": 1201,
"language": "en",
"words": [
{
"word": "."
}
]
},
{
"startTimeMs": 2500,
"stopTimeMs": 3000,
"language": "en",
"words": [
{
"word": "now"
}
]
},
{
"startTimeMs": 3000,
"stopTimeMs": 3200,
"language": "en",
"words": [
{
"word": "another"
}
]
}
]
}
[Note] The language value may or may not be present on the input. If it is not present, the engine may decide whether to try to guess the source language or return an error.
Engine output
Engine output is very similar to the engine input, conforming to the same transcript validation contract and mirroring the series array and startTimeMs/stopTimeMs values. However, transcript translation is a little more complicated than other translation types because transcripts have word-by-word timing, but single-word translations are not nearly as accurate as phrase or sentence translations. Therefore, in order to properly group the translated words and still maintain the time-based series structure, the series array has to be modified slightly.
The official transcript validation contract json-schema is available here.
The per-phrase translation algorithm
First combine the input transcript up into sections based on where there are either "." words or breaks in the timing (i.e. the startTimeMs of one word is more than 1 second later than the stopTimeMs of the previous word).
| Time | Content |
|---|
| 0-1201 | "this is a sentence." |
| 2500-3200 | "now another" |
Then for each section:
- Note of the minimum
startTimeMs and maximum stopTimeMs of all words in the section. - Submit the section for translation as one phrase.
- Break the resulting translation into words based on spaces or characters (depending on what constitutes a word in that language).
- Reconstruct the series array by having one series object per word (similar to how the input had one object per word) but set the
startTimeMs and stopTimeMs for each word equal to the minimum start and maximum stop times of the entire section (so they look like they overlap). Make sure to keep all words in the sentence in the proper order in the output array.
Here is some (Python-ish) pseudo-code that implements this algorithm:
destination_language = "es"
result_series = []
last_word_stop_time = 0
current_phrase = ""
min_start_time = None
max_stop_time = None
function translate_and_append(start, stop, phrase):
translated_phrase = your_translator(phrase, destination_language)
translated_words = break_into_words(translated_phrase) # Might be different per language
for word in translated_words:
result_series.append({
"startTimeMs": start,
"stopTimeMs": stop,
"language": destination_language,
"words": [{"word": word}]
})
# Reset temporary data
current_phrase = ""
min_start_time = None
max_stop_time = None
for object in input.series:
# Silences in the transcript longer than 1 second
if object.startTimeMs - last_word_stop_time < 1000:
translate_and_append(min_start_time, max_stop_time, current_phrase)
else:
current_phrase += object.words[0].word
if min_start_time is None or object.startTimeMs < min_start_time:
min_start_time = object.startTimeMs
if max_stop_time is None or object.stopTimeMs > max_stop_time:
max_stop_time = object.stopTimeMs
# Explicit period in the transcript
if object.words[0].word == ".":
translate_and_append(min_start_time, max_stop_time, current_phrase)
# Clear out any remaining phrase buffer
if current_phrase != "":
translate_and_append(min_start_time, max_stop_time, current_phrase)
return {
"schemaId": "https://docs.veritone.com/schemas/vtn-standard/master.json",
"validationContracts": ["transcript"],
"series": result_series
}
[Note] For simplicity, the pseudo-code above assumes that there is only one word in each object (i.e. object.words.length == 1). For transcripts with complex lattices (not common), object.words.length might be greater than 1. In those cases, you will want to select the words where object.words[n].bestPath is true. There will only be one of these in each words array.
Example output
{
"schemaId": "https://docs.veritone.com/schemas/vtn-standard/transcript.json",
"validationContracts": ["transcript"],
"series": [
{
"startTimeMs": 0,
"stopTimeMs": 1201,
"language": "es",
"words": [
{
"word": "esto"
}
]
},
{
"startTimeMs": 0,
"stopTimeMs": 1201,
"language": "es",
"words": [
{
"word": "es"
}
]
},
{
"startTimeMs": 0,
"stopTimeMs": 1201,
"language": "es",
"words": [
{
"word": "una"
}
]
},
{
"startTimeMs": 0,
"stopTimeMs": 1201,
"language": "es",
"words": [
{
"word": "frase"
}
]
},
{
"startTimeMs": 0,
"stopTimeMs": 1201,
"language": "es",
"words": [
{
"word": "."
}
]
},
{
"startTimeMs": 2500,
"stopTimeMs": 3200,
"language": "es",
"words": [
{
"word": "ahora"
}
]
},
{
"startTimeMs": 2500,
"stopTimeMs": 3200,
"language": "es",
"words": [
{
"word": "otro"
}
]
}
]
}