Title	File system

URL Name	000004268

Audience	Public

Product (Internal List) aiWare - aiWare

Body

In AI Processing, a file system is an input/output, or I/O, mechanism for engines to read data and write results. An engine can also use a file system as an execution-time cache, as required. For each discrete engine process, all of the information contained in the database and any additional debug data persists in the file system. The file system supports the planned execution path (DAG) and the various system error cases as the single source of persisted I/O data used by engines and controllers. It contains the data that either completed or in-progress tasks generate, which then enables the reprocessing of failed tasks. The file system is a Network File System (NFS) 4 and mounted with a Docker NFS deployment.

Folder structure

/cache
    /mapping.txt
    /<#> of cache instance
        /last two digits of org
            /orgId
                /job
                    /last 2 digits of job in lowercase
                        /job id
                            /definition.json (optional)
                            /status.json
                            /task
                                /taskid
                                    /io (adapters, chunk)
                                        /io-id
                                            /status.json
                                            /000000/ (thousends)
                                                001.data
                                                001.ctrl
                                                001.info (optional)
                                    /output.engine_id.log
                                    /error.engine_id.log
                                    /task_payload.json
                                    /metadata.json
                                    /status.json
                                /assets (SI Output)
                                    /playback
                                    /processing
                                /original
        /_failed_job
            /last 2 digits of job in lowercase
                /job id
                    /jobs and their children that have failed tasks are copied here
        /source
            /<source_id> -- transient can be gone
        /engine
            /<engineid>
            /instance
                /<engine_instance_id>
        /library

Cache folder

This is where engines read and write I/O data as part of the Job DAG execution. The controller defines the folder tree from the root cache folder to the I/O folder based on the Job DAG. The Engine Toolkit reads and writes from this space on behalf of the engines to get the data required to execute a task and to write results for subsequent tasks.

The cache folder is generally organized in this way:

Org > Job > Task > I/O

Each I/O folder contains chunks of data an engine is reading or writing. For each chunk of data, two files are saved to the I/O folder: .data and .ctrl. To meet file system best practices, large folders are subdivided into multiple directories by the last two digits of the ID, and I/O folders are subdivided by the thousands digit (e.g., 001.data, 001.ctrl., etc.).

Each task in the DAG is associated with three sets of file system folders:

Input folders where tasks get the data to process.
Output folders where chunks are stored that the tasks generate.
Child input folders that are used by the next task in the DAG.

To process a job DAG, each task requires 0-N inputs, either from previous tasks or from an external source. A task can have more than one I/O input folder. Similarly, each task can have multiple output folders, which are required if a task generates multiple data sets (e.g., both frames and audio files).

From an implementation perspective, an I/O object represents one output or one input folder in the File System. The same I/O object is used for both input and output folders because this designation depends on the needs of the consumer; however, the files in the folder will have the respective extension - .IN or .OUT.

File naming conventions

As Engines read and write from the respective I/O folders, they follow this naming convention:

[Index level 0] . [Index level *N*] _ [Timestamp] _ [Parent ID] . [IO Type] . [Base] . [Modifier] . [Attempt Count]

Item	Description
[Index level N]	A numeric counter that starts at 001 and increases with each output chunk or stream chunk. To construct the index of the output in a tiered way, e.g., 001.001_[rest of the filename], the engine uses multiples of those fields, separated by '.'.
[Timestamp]	(Non-stream files only). The engine provides the timestamp when the chunk is produced. If no time is provided, the default is the current time.
[Parent ID]	Each engine generates a GUID for its instance when it starts and registers with a controller. All the files that the engine instance generates use this GUID.
[IO Type]	There are two I/O types, `out` and `in`. When an engine produces a chunk it generates a chunk with I/O type `out` inside an I/O object for output. If child I/O objects exist, chunks of type `in` are also generated with hard links to the files of `out` chunks.
[Base]	There are three types: data - the contents of a chunk. ctrl - the control file. info - logs for that chunk.
[Modifier]	(Optional) Applies to chunks of I/O type `in`. When the modifier is missing, the chunk is always available. P - the chunk is being processed. DONE - the processing is finished. ERROR - the chunk can't be processed due to errors.
[Attempt Count]	(Optional) Applies to chunks of I/O type `in` with a modifier present. It specifies the current or last attempt that was made to process the chunk starting with 0.

Roles and definitions

These items define the files in the system and the information they contain.

Item	Description
mapping.txt / cache.json
[io_id]/00000/001.01_[timestamp]_[parentId].out.data	The data file of a chunk with index 001.01, a timestamp, and a parent ID.
[io_id]/00000/001.01_[timestamp]_[parentId].out.ctrl.DONE.3	The control file of a chunk with index 001.01, a timestamp, and a parent ID. The extension indicates the chunk was processed successfully after 3 attempts. This JSON file contains the CRC32 checksum of the data file and user metadata if any was provided when the chunk was finalized.

Created Date	7/30/2024 8:39 PM

Last Modified Date	7/30/2024 8:39 PM

Last Published Date	7/30/2024 8:39 PM

Article Record Type	Documentation

Veritone Record Type	Documentation

Article Number	000004268

File system

Folder structure

Cache folder

File naming conventions

Roles and definitions