Image processing performance optimization

The idea of this tutorial is to provide some optimization for image processing. In this example we decompress a CBF image from a Pilatus 6M, take the log-scale and calculate the historgram of pixel intensities, before plotting. This very simple opertion takes >300ms which makes it unpractical for live-display of images coming from a detector.

In [1]:
%pylab nbagg
Populating the interactive namespace from numpy and matplotlib
In [2]:
import time, os
start_time = time.time()
import silx
print("Using silx version ",silx.version)
from silx.resources import ExternalResources
downloader = ExternalResources("pyFAI", "http://www.silx.org/pub/pyFAI/testimages", "PYFAI_DATA")
fname = downloader.getfile("powder_200_2_0001.cbf")
print("filename", os.path.basename(fname))
import fabio
nbins = 1000
Using silx version  0.7.0-dev0
filename powder_200_2_0001.cbf
/users/kieffer/VirtualEnvs/py3/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
In [3]:
%%time

#Display an image and the histogram of values (in log scale)
img = fabio.open(fname).data
log_img = numpy.arcsinh(img) # arcsinh is well behaved log-like function
his, edges = numpy.histogram(log_img, nbins)
CPU times: user 252 ms, sys: 32 ms, total: 284 ms
Wall time: 280 ms
In [4]:
fig, ax = subplots(1,2,)
center = (edges[1:] + edges[:-1])/2.0 # this is the center of the bins
ax[1].imshow(log_img, cmap="inferno")
ax[0].plot(center,his)
Data type cannot be displayed: application/javascript
Out[4]:
[<matplotlib.lines.Line2D at 0x7fdf03bb7cf8>]

Optimized GPU code

This code prepares a GPU to be able to perform the same calculation.

There are 3 operations: * Decompress the CBF image * Calculate the log scale (arcsinh values) * Calculate the histogram

Both can be performed on the GPU without additional memory transfer

In [5]:
#switch this to "CPU" to have a fail safe
devicetype = "GPU"
from silx.opencl.codec.byte_offset import ByteOffset
from silx.opencl.image import ImageProcessing
import pyopencl, pyopencl.array, pyopencl.elementwise
cbf = fabio.cbfimage.CbfImage()
bo = ByteOffset(os.path.getsize(fname), img.size, devicetype=devicetype)
ash = pyopencl.elementwise.ElementwiseKernel(bo.ctx,
            arguments="float* data, float* res",
            operation="res[i] = asinh(data[i])",
            name='arcsinh_kernel')
ip = ImageProcessing(template=img, ctx=bo.ctx)
res = pyopencl.array.empty(bo.queue, img.shape, dtype=float32)
In [6]:
%%time
raw = cbf.read(fname, only_raw=True)
dec = bo(raw, as_float=True)
ash(dec, res)
his, edges =  ip.histogram(res, nbins, copy=False)
log_img = res.get()
CPU times: user 16 ms, sys: 16 ms, total: 32 ms
Wall time: 28.8 ms
In [7]:
fig, ax = subplots(1,2,)
center = (edges[1:] + edges[:-1])/2.0 # this is the center of the bins
ax[1].imshow(log_img, cmap="inferno")
ax[0].plot(center,his)
Data type cannot be displayed: application/javascript
Out[7]:
[<matplotlib.lines.Line2D at 0x7fdebd059390>]

The results are the same and the processing time is 6x faster. Hence one can envisage real-time visualization of images comming from detectors.

Investigation of the timings

In [8]:
ip.set_profiling(True)
bo.set_profiling(True)
ip.reset_log()
bo.reset_log()
raw = cbf.read(fname, only_raw=True)
dec = bo(raw, as_float=True)
ash(dec, res)
his, edges =  ip.histogram(res, nbins, copy=False)
log_img = res.get()
import os
print(os.linesep.join(bo.log_profile()))
print(os.linesep.join(ip.log_profile()))

Profiling info for OpenCL ByteOffset
                                   copy raw H -> D:  0.972ms
                                       memset mask:     0.091ms
                                    memset counter:     0.005ms
                                   mark exceptions:     0.092ms
                               copy counter D -> H:  0.002ms
                                  treat_exceptions:     0.027ms
                                       double scan:     0.200ms
                                      copy_results:     0.141ms
________________________________________________________________________________
                              Total execution time:     1.530ms

Profiling info for OpenCL ImageProcessing
                                        copy D->D :  0.104ms
                                    max_min_stage1:     0.069ms
                                    max_min_stage2:     0.009ms
                                         histogram:     0.098ms
________________________________________________________________________________
                              Total execution time:     0.281ms

Conclusion

This notebook explains how to optimize some heavy numerical processing to 10x speed-up to achieve real-time image processing using GPU.