[API][yes] [Search][yes] [UI][yes]
Text extraction engines extract textual information from documents, and express that extracted text in a structured format. They are similar to text recognition engines in their output data structure. But where text recognition engines are used to recognize text in unstructured files like images, text extraction engines are used to extract text content from semi-structured files like PDFs or Microsoft Word documents.
Engine output
Most text extraction engine output should be stored in the non-time-based object array in Engine output standard. Each string of text is represented as an object of type text.
See the text validation contract json-schema.
Ordering indexes
Each object may include any or all of the page/paragraph/sentence indexes:
page: represents a physical page in a page-aware document type like PDF or docx.paragraph: represents a section of content like a literary paragraph or a line number in less literary document formats. sentence: represents an individual expression of thought like a literary sentence or a grouped string of text
Each of the page/paragraph/sentence indexes must start at 1. If page is provided, the indexes for paragraph and sentence must be reset to 1 each time the page index is incremented.
If paragraph is provided, the index for sentence must be reset to 1 each time the paragraph index is incremented.
All indexes are optional but it is highly recommended to include at least one so that order is explicitly preserved. The output can also include a language code and confidence scores if desired.
Example
The following simple example shows what the output of a text extraction engine could look like:
{
"schemaId": "https://docs.veritone.com/schemas/vtn-standard/master.json",
"validationContracts": ["text"],
"object": [
{
"type": "text",
"text": "Le rapide renard brun sauta par dessus le chien paresseux.",
"page": 1,
"paragraph": 1,
"sentence": 1
},
{
"type": "text",
"text": "Cela a effrayé le chien.",
"page": 1,
"paragraph": 1,
"sentence": 2
}
]
}