Struct GeneralConfig

Struct Documentation

struct switchml::GeneralConfig

Struct that groups general configuration options that must always be configured.

Public Members

uint16_t rank

A unique identifier for a worker node. Like MPI ranks.

uint16_t num_workers

The number of worker nodes in the system

uint16_t num_worker_threads

The number of worker threads to launch for each node

uint32_t max_outstanding_packets

The maximum number of pending packets for this worker (Not worker thread).

This number is divided between worker threads. This means that each worker thread will first send its initial burst up to this number divided by num_worker_threads. Then sends new packets only after packets are received doing this until all packets have been sent.

If you have this set to 256 and num_worker_threads set to 8 then each worker thread will send up to 32 packets.

uint64_t packet_numel

The number of elements in a packet

std::string backend

Which backend should the SwitchML client use?. Choose from [‘dummy’, ‘dpdk’, ‘rdma’]. Make sure that the backend you choose has been compiled.

std::string scheduler

Which scheduler should we use to dispatch jobs to worker threads?. Choose from [‘fifo’].

std::string prepostprocessor

Which prepostprocessor should we use to load and unload the data into and from the network. Choose from [‘bypass’, ‘cpu_exponent_quantizer’]

bool instant_job_completion

If set to true then all jobs will be instantly completed regardless of the job type. This is used for debugging to disable all backend communication. The backend is still used to for setup and cleanup.

std::string controller_ip_str

The IP address of the machine that’s running the controller program. Note: This is not the same as the ip address that is passed to the switch_ip argument when starting the controller.

uint16_t controller_port

The port that the controller program is using. This is the value that you passed to the port argument when starting the controller.

double timeout

How much time in ms should we wait before we consider that a packet is lost.

Each worker thread creates a copy of this value at the start of working on a job slice. From that point the timeout value can be increased if the number of timeouts exceeds a threshold as a backoff mechanism.

uint64_t timeout_threshold

How many timeouts should occur before we double the timeout time?

uint64_t timeout_threshold_increment

By how much should we increment the threshold each time its exceeded. (Setting the bar higher to avoid doubling the timeout value too much)