Notes on computations distribution#

The computations can be distributed in several ways. This section gives a more in-depth understanding of the integrate-slurm and integrate-mp commands.

1. How integrate-slurm works#

From the configuration file (section [dataset]), the user provides a list of datasets. Each dataset consists in a collection of frames (hundreds to tens of thousands).

The command integrate-slurm will distribute the datasets integration over n_workers, each using cores_per_worker threads (these parameters are tuned in the [computations distribution] of the configuration file).

The datasets are integrated one after the other. Each dataset is split among the worker: each worker will process a part of the current dataset.

Advantages

  • Each dataset can be integrated fast. This means that you can have a feed-back on a given dataset quickly.

  • Scales well with the number of workers.

Drawbacks

  • Need resources, often occupied

  • Brittle when using many workers (although it works fine with 40 workers)

  • Only runs on SLURM (not on the local machine), needs to connect to a SLURM front-end

2. How integrate-mp works#

The integrate-mp program was created to address the shortcomings of integrate-slurm:

  • Possibility to run on the local machine

  • Fully utilize the GPU processing power by assigning several workers to each GPU (not possible natively on SLURM)

  • Use SLURM partitions that have more availability

The drawback is that at a given time, each dataset is processed by exactly one worker. This means that integrating one dataset is not extremely fast (hence not suited for near-real-time processing). This program is rather intended for bulk processing of hundreds of datasets.

2.1 - Using integrate-mp on the local machine (partition = local)#

To distribute the azimuthal integration on the local machine, choose partition = local in [computations distribution]. A total of n_workers will be launched, each having cores_per_worker CPUs.

The GPUs are used if possible. If a machine has multiple gpus, they will be shared evenly among the workers.

There can (and should) be multiple workers per GPU. The rationale is, one azimithal integrator is not enough to exploit the full GPU computing power.

How to tune the number of workers/cores for maximum performance ?

As a rule of thumb:

  • 4 workers per GPU (8 for powerful GPUs like V100, A40, A100)

  • 4 threads per worker for fast LZ4 decompression (2 are enough for fast-paced processor without hyperthreading, eg. AMD EPYC 7543)

  • workers * threads should not exceed the machine’s total (virtual) CPUs, ideally half of them

2.2 - Using integrate-mp on multiple machines#

The integrate-mp command can also be used to distribute computations on several machines, using SLURM. The partition = gpu or partition = p9gpu should be used in [computations distribution] (not local).

When using this mode, n_workers has a different meaning. Here, n_workers refers to the number of SLURM jobs. It roughly means how many nodes (or half-node) we want to distribute computations on.

You can think of it as “big workers”: each n_workers spanws multiple workers. By default:

  • 8 workers by n_workers

  • 4 cores/worker

  • Each “big worker” (n_workers) uses one GPU

For example, using partition = p9gpu and n_workers = 4 means that 4 SLURM jobs are submitted (asking for one GPU each), each spawning 8 workers, so there will be 32 workers in total.

Warning

Remember: when using this mode, n_workers is the number of SLURM jobs, i.e roughly the number of half-nodes requested. n_workers = 4 means 32 workers in total.