Information

Title	Speaker detection engines

URL Name	000004220

Audience	Public

Product Selection

Product (Internal List) aiWare - aiWare

Article Details

Body

[API][yes]

[Search][yes]

[UI][yes]

A speaker detection engine (also called a diarization or speaker separation engine) partitions an audio stream into segments, based on who is speaking and when. Unlike a speaker recognition engine, a speaker detection engine merely determines when speakers change and possibly which speakers are the same person. It does not try to identify who the speaker is.

Engine input

Audio-processing engines can be stream processing engines, or if processing will be stateless, they can be segment processing. See Engine processing modes to learn more.

All engines that process audio will receive audio data with MIME type "audio/wav" (.mp3 and .mp4 are not natively supported). If your engine needs a format other than audio/wav, you will need to transcode incoming wav data to the appropriate target format using something like ffmpeg.

Engine output

Within the time-based series array in the engine's output, each speaker detection record (that is, each series entry) should contain an object of type speaker.

Example

Here is an example of the simplest type of speaker detection output:

{
  "schemaId": "https://docs.veritone.com/schemas/vtn-standard/master.json",
  "validationContracts": [
    "speaker"
  ],
  "series": [
    {
      "startTimeMs": 0,
      "stopTimeMs": 10000,
      "object": {
        "type": "speaker",
        "label": "Tim"
      }
    }
  ]
}

Legacy format

There is a legacy format that is still supported but has been deprecated.

{
  "series": [
    {
      "startTimeMs": 0,
      "stopTimeMs": 10000,
      "speakerId": "<label>"
    }
  ]
}

See Also

Learn about AION.
Learn more about Engine processing modes.
See the cognitive engine classes.
See an overview of cognitive technology.
View engines for your media type.

Additional Technical Documentation Information

Breadcrumbs

On this page

Properties

Created Date	1/16/2024 9:21 PM

Last Modified Date	1/16/2024 9:22 PM

Last Published Date	1/16/2024 9:22 PM

Article Record Type	Documentation

Veritone Record Type	Documentation

Article Number	000004220

Translation Information

Language English

Speaker detection engines

Engine input

Engine output

Example

Legacy format