Title	Speaker recognition engines

URL Name	000003981

Audience	Public

Product (Internal List) aiWare - aiWare

Body

[badge/API][yes]

[badge/Search][partial]

[badge/UI][no]

Speaker recognition (also called speaker identification) engines identify when speakers change and who those speakers are in a piece of audio.

They expand upon the capabilities of speaker detection engines by identifying the individual whose voice was detected in addition to specifying where in time the person started and stopped talking.

Training and libraries

Since speaker recognition engines identify entities, they are required to be trainable via libraries.

Engine input

Audio-processing engines can be stream processing engines, or if processing will be stateless they can be segment processing. See Engine processing modes to learn more.

[Note] All engines that process audio will receive audio data with MIME type "audio/wav" (.mp3 and .mp4 are not natively supported). If your engine needs a format other than audio/wav, you will need to transcode incoming wav data to the appropriate target format using something like ffmpeg.

Engine output

Within the time-based series array in the engine's output, each speaker recognition record (that is, each series entry) should contain an object of type speaker. Because each speaker maps back to an entity in a library, each object should include the entityId of that original entity, along with the libraryId where it can be found.

Example

Here is an example of the simplest type of speaker recognition output:

{
  "schemaId": "https://docs.veritone.com/schemas/vtn-standard/master.json",
  "validationContracts": [
    "speaker"
  ],
  "series": [
    {
      "startTimeMs": 0,
      "stopTimeMs": 10000,
      "object": {
        "type": "speaker",
        "label": "Tim",
        "entityId": "<entity ID>",
        "libraryId": "<library ID>"
      }
    }
  ]
}

Legacy format

There is a legacy format that is still supported but has been deprecated.

{
  "series": [
    {
      "startTimeMs": 0,
      "stopTimeMs": 10000,
      "speakerId": "<libraryId>:<entityId> (if it maps to an entity, otherwise just a label)"
    }
  ]
}

Created Date	1/16/2024 9:22 PM

Last Modified Date	1/16/2024 9:23 PM

Last Published Date	1/16/2024 9:23 PM

Article Record Type	Documentation

Veritone Record Type	Documentation

Article Number	000003981

Speaker recognition engines

Training and libraries

Engine input

Engine output

Example

Legacy format