Salesforce

Transcription engines

« Go Back
Information
Transcription engines
000003957
Public
Product Selection
aiWare - aiWare
Article Details
[badge/API][yes/green]
[badge/Search][yes/green]
[badge/UI][yes/green]

Transcription engines, also known as speech-to-text or natural language processing (NLP), take recorded speech audio and output the words that were said. Depending on the engine's capabilities, the output is a simple sequence of words or a "lattice of confidence" expressing multiple options for how the words were spoken.

Engine input

Audio-processing engines can be stream processing engines, or, if processing will be stateless, they can be segment processing engines. A transcription engine is typically stateful and operates in stream processing mode. See Engine processing modes to learn more.

All engines that process audio will receive audio data with MIME type "audio/wav" (.mp3 and .mp4 are not natively supported). If your engine needs a format other than audio/wav, you will need to transcode incoming wav data to the appropriate target format using something like ffmpeg.

Engine output

Transcribed words can be reported in engine output by using the words array within the series array.

The official transcript validation contract json-schema is available here.

Example

Here is an example of the simplest type of transcript output:

{
  "schemaId": "https://docs.veritone.com/schemas/vtn-standard/transcript.json",
  "validationContracts": ["transcript"],
  "series": [
    {
      "startTimeMs": 0,
      "stopTimeMs": 300,
      "words": [
        {
          "word": "this"
        }
      ]
    },
    {
      "startTimeMs": 300,
      "stopTimeMs": 500,
      "words": [
        {
          "word": "is"
        }
      ]
    },
    {
      "startTimeMs": 500,
      "stopTimeMs": 800,
      "words": [
        {
          "word": "a"
        }
      ]
    },
    {
      "startTimeMs": 800,
      "stopTimeMs": 1200,
      "words": [
        {
          "word": "sentence"
        }
      ]
    }
  ]
}

Adding confidence scores

In addition to the basic array of phrases, the confidence key can be used to indicate how confident the engine is in a given result (0-100%).

{
  "schemaId": "https://docs.veritone.com/schemas/vtn-standard/transcript.json",
  "validationContracts": ["transcript"],
  "series": [
    {
      "startTimeMs": 0,
      "stopTimeMs": 300,
      "words": [
        {
          "word": "this",
          "confidence": 0.2
        }
      ]
    },
    {
      "startTimeMs": 300,
      "stopTimeMs": 500,
      "words": [
        {
          "word": "is",
          "confidence": 0.1
        }
      ]
    },
    {
      "startTimeMs": 500,
      "stopTimeMs": 800,
      "words": [
        {
          "word": "a",
          "confidence": 0.1
        }
      ]
    },
    {
      "startTimeMs": 800,
      "stopTimeMs": 1200,
      "words": [
        {
          "word": "sentence",
          "confidence": 0.1
        }
      ]
    }
  ]
}

Adding transcription lattices

A "transcription lattice" is the concept of expressing multiple possibilities of words that were spoken, with various confidences assigned to each. If an engine is able to output this type of nuanced information, the benefit is that aiWARE will index all possibilities so that the user will be able to search against all possibilities at once.

To output a lattice of confidence, you can write multiple entries to the words array and add the following keys to each entry:

KeyExplanation
confidenceHow confident the engine is in this result as a percentage (0-100)
bestPathThe best (most likely or most confident) path through the lattice can be constructed by only including the words where bestPath is true. Usually, this is the word in the words array with the highest confidence.
utteranceLengthIn some cases, one utterance might span multiple word slots. For example, if two possibilities for a phrase were "throne" or "their own", the first "throne" entry would have an utteranceLength of 2 while the "their" and "own" entries would have utterance lengths of 1. See the example below.

When reporting lattices, use the following rules:

  • For every words array with multiple entries, a confidence value must be included on each entry.
  • For every words array with multiple entries, one and only one of the entries must contain the bestPath key with a value of true
  • For every words array with multiple entries, utteranceLength keys must be added to each entry.
{
  "schemaId": "https://docs.veritone.com/schemas/vtn-standard/transcript.json",
  "validationContracts": ["transcript"],
  "series": [
    {
      "startTimeMs": 0,
      "stopTimeMs": 200,
      "words": [
        {
          "word": "veritone",
          "confidence": 0.7,
          "bestPath": true,
          "utteranceLength": 2
        },
        {
          "word": "baritone",
          "confidence": 0.1,
          "utteranceLength": 2
        },
        {
          "word": "very",
          "confidence": 0.1,
          "utteranceLength": 1
        },
        {
          "word": "berry",
          "confidence": 0.1,
          "utteranceLength": 1
        }
      ]
    },
    {
      "startTimeMs": 200,
      "stopTimeMs": 300,
      "words": [
        {
          "word": "veritone",
          "confidence": 0.7,
          "bestPath": true,
          "utteranceLength": 1
        },
        {
          "word": "baritone",
          "confidence": 0.1,
          "utteranceLength": 1
        },
        {
          "word": "tone",
          "confidence": 0.1,
          "utteranceLength": 1
        },
        {
          "word": "phone",
          "confidence": 0.1,
          "utteranceLength": 1
        }
      ]
    }
  ]
}

Translating transcripts

Some translation engines will take the outputs of transcription engines as input to their translation engines. To learn how those engines are built please see translating transcripts.

Additional Technical Documentation Information
Properties
5/6/2024 11:06 PM
5/6/2024 11:10 PM
5/6/2024 11:10 PM
Documentation
Documentation
000003957
Translation Information
English

Powered by