The controller node is a component of the aiWARE platform that registers hosts and engine instances, manages all work, and communicates with the database layer.
The controller manages the load between engines and recalculates load as needed based on existing and anticipated resources, until all tasks are assigned to engines.
The following topics are covered below:
- The roles and responsibilities of controllers
- Controller process flow
- Engine process flow
- Controller communication and authentication
- Task processing, engine allocation, and performance
Controller roles and responsibilities
Controllers are stateless API services, typically deployed behind a load balancer. Within a cluster, a single controller node is promoted to serve as primary controller and the aiWARE processing supervisor, and is responsible for specific critical functions including initial startup, core sync, usage reporting, engine loading, and removing stopped hosts.
The controller's principal function is to assign tasks to engines and adapters in an optimal manner to meet SLAs and minimize costs.
There are several key functions that a controller manages:
- Control the starting and stopping of engines
- Provide data to the primary controller to properly scale engine hardware
- Provide stats data to the database to properly forecast engine and hardware demand
- Route data from one task to the next
- Control the assignment of engines to tasks
- Manage failures and retries
- Log data for analysis
- Expose billing metrics
- Communicate with external services
Controller process flow
Database connection
Upon launch, the controller establishes a read and write connection to the database. Engine servers (created at aiWARE processing launch) run a local aiWARE agent. This aiWARE agent connects to a controller and provides status updates on the engines running, as well as the resource capacity and current usage (memory, CPU, disk).
Engine check and assignment
The controller checks the database to see what engines it can run on AI Processing, what the base configuration of engines should be, and how many total startup engines will be available.
As each engine instance comes online, it makes HTTP requests to the controller. It registers itself with the controller, which in turn stores this information in the database.
Job processing
Once all the startup servers have launched, are registered in the database, and the startup engines are running, the engines will query the controller for work. The controller receives these requests and assigns specific processing tasks to each engine as long as there is work to do. Engines report back status and progress on processing to the controller via HTTP POST. This data is logged in database tables. For more information about job tasks, see task processing, engine allocation, and performance.
Engine process flow
How engines are initiated
The controller sends a message to the engine agent running on the server, which in turn launches an engine instance with a Docker run command.
The controller assigns the engine instance a task ID and the number of units to process. A unit is a file in the task input folder. For example, there may be 10,000 files (units) in the input task folder, but the controller might only instruct the engine instance to process 100 of them before checking back in for more work. This allows the controller to reallocate the engine instance and have it work on other higher priority tasks without blocking this task until all 10,000 files are done.
The engine agent reads the directory of files left to be processed in the task ID input folder. It randomizes the list. It selects a number of units of work from the randomized list (configurable). If it is successful in opening a file to work, it marks it as being processed.
When the engine agent has finished the work on the file, it marks the file as completed and writes the engine output to the output folder.
Engine status
When the work is complete, the engine agent notifies the controller that all work assigned is done.