Salesforce

Transcript translation engines

« Go Back
Information
Transcript translation engines
000004167
Public
Product Selection
aiWare - aiWare
Article Details
[API][yes]
[Search][no]
[UI][no]

Transcripts are one of the five input formats that translation engines can support. Transcripts are the engine output of a transcription (speech-to-text) engine.

[Warning] In order to use transcript translation engines, it is necessary to chain the output of a transcription engine to the input of the translation engine in one job. The platform will not currently handle this routing for you.

Engine input

Transcript translation engines should be implemented as segment processing engines. Each segment will be a .aion snippet containing recognized text (conforming to the transcript validation contract).

Example input

{
  "schemaId": "https://docs.veritone.com/schemas/vtn-standard/master.json",
  "validationContracts": [
    "transcript"
  ],
  "series": [
    {
      "startTimeMs": 0,
      "stopTimeMs": 300,
      "language": "en",
      "words": [
        {
          "word": "this"
        }
      ]
    },
    {
      "startTimeMs": 300,
      "stopTimeMs": 500,
      "language": "en",
      "words": [
        {
          "word": "is"
        }
      ]
    },
    {
      "startTimeMs": 500,
      "stopTimeMs": 800,
      "language": "en",
      "words": [
        {
          "word": "a"
        }
      ]
    },
    {
      "startTimeMs": 800,
      "stopTimeMs": 1200,
      "language": "en",
      "words": [
        {
          "word": "sentence"
        }
      ]
    },
    {
      "startTimeMs": 1200,
      "stopTimeMs": 1201,
      "language": "en",
      "words": [
        {
          "word": "."
        }
      ]
    },
    {
      "startTimeMs": 2500,
      "stopTimeMs": 3000,
      "language": "en",
      "words": [
        {
          "word": "now"
        }
      ]
    },
    {
      "startTimeMs": 3000,
      "stopTimeMs": 3200,
      "language": "en",
      "words": [
        {
          "word": "another"
        }
      ]
    }
  ]
}
[Note] The language value may or may not be present on the input. If it is not present, the engine may decide whether to try to guess the source language or return an error.

Engine output

Engine output is very similar to the engine input, conforming to the same transcript validation contract and mirroring the series array and startTimeMs/stopTimeMs values. However, transcript translation is a little more complicated than other translation types because transcripts have word-by-word timing, but single-word translations are not nearly as accurate as phrase or sentence translations. Therefore, in order to properly group the translated words and still maintain the time-based series structure, the series array has to be modified slightly.

The official transcript validation contract json-schema is available here.

The per-phrase translation algorithm

First combine the input transcript up into sections based on where there are either "." words or breaks in the timing (i.e. the startTimeMs of one word is more than 1 second later than the stopTimeMs of the previous word).

TimeContent
0-1201"this is a sentence."
2500-3200"now another"

Then for each section:

  1. Note of the minimum startTimeMs and maximum stopTimeMs of all words in the section.
  2. Submit the section for translation as one phrase.
  3. Break the resulting translation into words based on spaces or characters (depending on what constitutes a word in that language).
  4. Reconstruct the series array by having one series object per word (similar to how the input had one object per word) but set the startTimeMs and stopTimeMs for each word equal to the minimum start and maximum stop times of the entire section (so they look like they overlap). Make sure to keep all words in the sentence in the proper order in the output array.

Here is some (Python-ish) pseudo-code that implements this algorithm:

destination_language = "es"
result_series = []
last_word_stop_time = 0
current_phrase = ""
min_start_time = None
max_stop_time = None

function translate_and_append(start, stop, phrase):
  translated_phrase = your_translator(phrase, destination_language)
  translated_words = break_into_words(translated_phrase)  # Might be different per language
  for word in translated_words:
    result_series.append({
      "startTimeMs": start,
      "stopTimeMs": stop,
      "language": destination_language,
      "words": [{"word": word}]
    })
  # Reset temporary data
  current_phrase = ""
  min_start_time = None
  max_stop_time = None

for object in input.series:
  # Silences in the transcript longer than 1 second
  if object.startTimeMs - last_word_stop_time < 1000:
    translate_and_append(min_start_time, max_stop_time, current_phrase)

  else:
    current_phrase += object.words[0].word
    if min_start_time is None or object.startTimeMs < min_start_time:
      min_start_time = object.startTimeMs
    if max_stop_time is None or object.stopTimeMs > max_stop_time:
      max_stop_time = object.stopTimeMs

    # Explicit period in the transcript
    if object.words[0].word == ".":
      translate_and_append(min_start_time, max_stop_time, current_phrase)

# Clear out any remaining phrase buffer
if current_phrase != "":
  translate_and_append(min_start_time, max_stop_time, current_phrase)
  
return {
  "schemaId": "https://docs.veritone.com/schemas/vtn-standard/master.json",
  "validationContracts": ["transcript"],
  "series": result_series
}
[Note] For simplicity, the pseudo-code above assumes that there is only one word in each object (i.e. object.words.length == 1). For transcripts with complex lattices (not common), object.words.length might be greater than 1. In those cases, you will want to select the words where object.words[n].bestPath is true. There will only be one of these in each words array.

Example output

{
  "schemaId": "https://docs.veritone.com/schemas/vtn-standard/transcript.json",
  "validationContracts": ["transcript"],
  "series": [
    {
      "startTimeMs": 0,
      "stopTimeMs": 1201,
      "language": "es",
      "words": [
        {
          "word": "esto"
        }
      ]
    },
    {
      "startTimeMs": 0,
      "stopTimeMs": 1201,
      "language": "es",
      "words": [
        {
          "word": "es"
        }
      ]
    },
    {
      "startTimeMs": 0,
      "stopTimeMs": 1201,
      "language": "es",
      "words": [
        {
          "word": "una"
        }
      ]
    },
    {
      "startTimeMs": 0,
      "stopTimeMs": 1201,
      "language": "es",
      "words": [
        {
          "word": "frase"
        }
      ]
    },
    {
      "startTimeMs": 0,
      "stopTimeMs": 1201,
      "language": "es",
      "words": [
        {
          "word": "."
        }
      ]
    },
    {
      "startTimeMs": 2500,
      "stopTimeMs": 3200,
      "language": "es",
      "words": [
        {
          "word": "ahora"
        }
      ]
    },
    {
      "startTimeMs": 2500,
      "stopTimeMs": 3200,
      "language": "es",
      "words": [
        {
          "word": "otro"
        }
      ]
    }
  ]
}
Additional Technical Documentation Information
Properties
5/6/2024 11:17 PM
5/6/2024 11:22 PM
5/6/2024 11:22 PM
Documentation
Documentation
000004167
Translation Information
English

Powered by