Processing: concepts and classes

Nabu can be used as a library, either to use simple processing utilities, or to build a full pipeline on top of it. In any case, it is useful to know how things are organized internally.

When it comes to processing, nabu organizes its entities (classes, really) according to three "classes types" from a bottom-up perspective:

Processing
Pipeline
Reconstructor

This page explains what each type is about. In short, processing classes are assembled to form a pipeline, and pipeline(s) are configured/assembled by a Reconstructor.

NB: Nabu can also be explored through another conceptual classification: its modules, which roughly reflect "steps" in a processing pipeline (i/o, pre-processing, reconstruction, post-processing).

Processing objects

Processing entities are functions or classes acting primarily on arrays (numpy array or extensions like pycuda/pyopencl arrays).

These functions/classes should

Be as straightforward to use as possible
Primarily act on arrays
Have a restricted scope ("do one thing and do it well")

This is an opinionated design decision of Nabu.

Rationale for points (1) and (2) is fast prototyping (for example from a ipython console, notebook or script), which is probably one factor of the success of tomopy. Numpy array are the ubiquitous data container in python scientific libraries.

Point (3) allows for more robustness by making the code easier to read, and the use of unit tests.

These "atomic" building blocks are meant to be chained together to form a series of processing steps (a pipeline).

Examples of such "processing classes" are FlatField, PaganinPhaseRetrieval, Backprojector, ...

Pipelines

A "Nabu pipeline" is, quite naturally, an assembling of "processing objects". For example, a simple pipeline can be obtained by chaining objects DataReader, FlatField and FBP. This is what would correspond to a python script implementing procedural steps one after the other. However, as needs usually vary from one beamline to the other, "Nabu pipelines" have to be flexible.

The processing objects should be used in a way that the code should not be edited if we want to use them differently. Therefore, the pipeline should be made configurable through an external user configuration (ex. configuration file). Then, the "Nabu pipeline" must ingest this user configuration and use it to configure its internal processing objects.

To sum up, Nabu pipelines are made of the following ingredients: - Processing building blocks (eg. FlatField) - Information on how to use (configure) these building blocks - Information on the dataset

This page explains how to extract the user configuration, translate this user configuration to actual processing classes parameters, and parse the dataset.

FullFieldPipeline and FullRadiosPipeline are examples of such pipelines.

Things could stop here, but in Nabu there is the following design decision: a pipeline is used to process a (sub)volume that fits in memory.
This seems to contrast with one primary purpose of nabu, which is to handle large amounts of data. We see in the following section how to handle this limitation.

Reconstructor

Reconstructors are the final, ready-to-use objects to perform a full volume reconstruction. A Reconstructor creates/configures/manages Pipeline objects, in a similar way that a Pipeline assembles Processing objects together.

We may wonder why such objects are needed in the first place. After all, the "Pipeline objects" described above could be able to handle data not fitting in memory. The short answer is work distribution.

When it comes to distributing the work (reconstructing sub-volumes), there are two possible approaches:

Distribute the work within Pipeline object.
Distribute the work outside Pipeline object

Approach (1) means that each Pipeline class must implement the workload distribution logic. This distribution logic depends on at least the following factors:

How data is handled (group of vertical images, horizontal slabs, etc)
What is the target: local machine, task scheduler (SLURM), etc.

This means that each Pipeline class must implement as least four distribution logics.

Instead, we follow approach (2): a Pipeline object is bound to a certain chunk/group size, computed so that the subvolume fits in memory. We therefore need a "Pipelines manager", which in our case is called Reconstructor, to handle the logic of distributing the work.

The Reconstructor is responsible for determining how a volume will be reconstructed by one or several Pipeline objects. It notably has to estimate the available resources (host/GPU memory, number of CPU cores, etc), and possibly distribute the workload.

Understanding the classes types through a simple example

Suppose you want to build a very simple processing pipeline consisting in the following steps: - Read data - Perform flat-field normalization - Transpose the volume (to get sinograms) - Perform FBP reconstruction - Save the resulting image

As there are five steps, the pipeline will be obtained by chaining five "building blocks": Reader, FlatField, Transpose, FBP, Writer - each of them can be a custom function or built-in nabu class/function.

Our simple pipeline - let's call it SimplePipeline - consists in assembling the five aforementioned building blocks. This SimplePipeline class (or function) will have to implement some logic, ex. pass the result of Reader to FlatField.

One problem we face almost immediately is that our SimplePipeline cannot "ingest" (process) all a dataset in one single pass. Usually, the data volume is too big for being transposed in one step.
NB: Note that for simpler "image processing pipelines", where no transposition is needed, this would not be a problem. The pipeline would process one image at a time (or a several simultaneously to hide disk latency) in a loop. But because of the very nature of tomography reconstruction (one output voxel needs information from all the input radios images), things are more complicated.

Therefore, SimplePipeline must be able to process a subset (sub-volume) of the dataset. Let's assume that this pipeline processes group of radios, then it will be called as

pipeline = SimplePipeline(size=100, ...) # process by group of 100 images

pipeline.process(subset=(0, 100))
pipeline.process(subset=(100, 200))
# ...

The Reconstructor classes in nabu are simply classes encapsulating the above logic. They automatically compute the subset size from the machine available memory, and do the successive calls to pipeline.process().