# Notes on computations distribution

The computations can be distributed in several ways.
This section gives a more in-depth understanding of the `integrate-slurm` and `integrate-mp` commands.

## 1. How `integrate-slurm` works

From the configuration file (section `[dataset]`), the user provides a list of datasets.
Each dataset consists in a collection of frames (hundreds to tens of thousands).

The command `integrate-slurm` will distribute the datasets integration over `n_workers`, each using `cores_per_worker` threads
(these parameters are tuned in the `[computations distribution]` of the configuration file).


The datasets are integrated one after the other. Each dataset is split among the worker: each worker will process a part of the current dataset.


Advantages
  - Each dataset can be integrated fast. This means that you can have a feed-back on a given dataset quickly.
  - Scales well with the number of workers.

Drawbacks
  - Need resources, often occupied
  - Brittle when using many workers (although it works fine with 40 workers)
  - Only runs on SLURM (not on the local machine), needs to connect to a SLURM front-end


## 2. How `integrate-mp` works

The `integrate-mp` program was created to address the shortcomings of `integrate-slurm`:

  - Possibility to run on the local machine
  - Fully utilize the GPU processing power by assigning several workers to each GPU (not possible natively on SLURM)
  - Use SLURM partitions that have more availability

The drawback is that **at a given time, each dataset is processed by exactly one worker**.
This means that integrating one dataset is not extremely fast (hence not suited for near-real-time processing).
This program is rather intended for bulk processing of hundreds of datasets.


### 2.1 - Using `integrate-mp` on the local machine (`partition = local`)

To distribute the azimuthal integration on the local machine, choose `partition = local` in `[computations distribution]`.
A total of `n_workers` will be launched, each having `cores_per_worker` CPUs.

The GPUs are used if possible. If a machine has multiple gpus, they will be shared evenly among the workers.

There can (and should) be multiple workers per GPU.
The rationale is, one azimithal integrator is not enough to exploit the full GPU computing power.


**How to tune the number of workers/cores for maximum performance ?**

As a rule of thumb:

  - 4 workers per GPU (8 for powerful GPUs like V100, A40, A100)
  - 4 threads per worker for fast LZ4 decompression (2 are enough for fast-paced processor without hyperthreading, eg. AMD EPYC 7543)
  - `workers * threads` should not exceed the machine's total (virtual) CPUs, ideally half of them


### 2.2 - Using `integrate-mp` on multiple machines


The `integrate-mp` command can also be used to distribute computations on several machines, using SLURM.
The `partition = gpu` or `partition = p9gpu` should be used in `[computations distribution]` (not `local`).

**When using this mode, `n_workers` has a different meaning.**
Here, `n_workers` refers to the number of SLURM jobs.
It roughly means how many nodes (or half-node) we want to distribute computations on.

You can think of it as "big workers": **each `n_workers` spanws multiple workers**. By default:
  - 8 workers by `n_workers`
  - 4 cores/worker
  - Each "big worker" (`n_workers`) uses one GPU

For example, using `partition = p9gpu` and `n_workers = 4` means that 4 SLURM jobs are submitted (asking for one GPU each), each spawning 8 workers, so there will be 32 workers in total.


```{Warning}
Remember: when using this mode, `n_workers` is the number of SLURM jobs, i.e roughly the number of half-nodes requested.
`n_workers = 4` means 32 workers in total.
```