Title	Translate extracted text engines

URL Name	000004108

Audience	Public

Product (Internal List) aiWare - aiWare

Body

[API][yes]

[Search][no]

[UI][partial]

Extracted text is one of the five input formats that translation engines can support. Extracted text is the engine output of a text extraction engine. When a translation engine supports this input type, it allows aiWARE to parse and normalize the contents of various document formats (e.g. .doc, .docx, .pdf, .xls, etc.) into one consolidated format. As a result, you can focus on a single input format: vtn-standard (according to the text validation contract).

[Warning] To use extracted text translation engines, chain the output of a text extraction engine to the input of the translation engine in one job. The platform doesn't currently handle this routing for you.

Engine manifest

Extracted text translation engines should specify the following parameters in their build manifest:

Parameter	Value
`engineMode`	`"chunk"`
`preferredInputFormat`	`"application/json"`
`supportedInputFormats`	`["application/json"]`
`outputFormats`	`["application/json"]`

Here is a minimal example manifest.json that could apply to a translation engine:

{
  "engineId": "<your engine ID from Veritone Developer>",
  "preferredInputFormat": "application/json",
  "supportedInputFormats": [
    "application/json"
  ],
  "outputFormats": [
    "application/json"
  ],
  "engineMode": "chunk",
  "clusterSize": "small"
}

Engine input

Extracted text translation engines should be implemented as segment processing engines. Each segment will be a .aion snippet containing extracted text (conforming to the text validation contract).

[Note] The input format is very similar to the input format for recognized text translation engines, so these two types are often supported together in the same engine.

Example input

{
  "schemaId": "https://docs.veritone.com/schemas/vtn-standard/master.json",
  "validationContracts": [
    "text"
  ],
  "language": "en-US",
  "object": [
    {
      "type": "text",
      "text": "The quick brown fox jumped over the lazy dog.",
      "page": 1,
      "paragraph": 1,
      "sentence": 1
    }, {
      "type": "text",
      "text": "That freaked out the dog.",
      "page": 1,
      "paragraph": 1,
      "sentence": 2
    }
  ]
}

The language value may or may not be present on the input. If it is not present, the engine may decide whether to try to guess the source language or return an error.

Engine output

Engine output is very similar to engine input, conforming to the same text validation contract and mirroring the object array and page/paragraph/sentence indices. The only things that usually change are the language code and the values in the text keys.

See the text validation contract json-schema.

Example output

{
  "schemaId": "https://docs.veritone.com/schemas/vtn-standard/master.json",
  "validationContracts": ["text"],
  "language": "fr",
  "object": [
    {
      "type": "text",
      "text": "Le rapide renard brun sauta par dessus le chien paresseux.",
      "page": 1,
      "paragraph": 1,
      "sentence": 1
    },
    {
      "type": "text",
      "text": "Cela a effrayé le chien.",
      "page": 1,
      "paragraph": 1,
      "sentence": 2
    }
  ]
}

[Note] The page, paragraph, sentence values are optional for the input. Whatever is given in the input for those values (if any) should be returned on the output as well.