Create HDF5 content from a schema#

The hierarchical content of an HDF5 file can be represented as a Python dict instance.

The function silx.io.dictdump.dicttonx() has the following convention for keys and values to represent HDF5 concepts:

Schema key

Value type

HDF5 meaning

  • dict

Group

  • any except dict

  • e.g. number, string, list, numpy array

Dataset

  • starting with @ or containing @

  • tuple(<item-name>, <attr-name>)

  • e.g. number, string, list, numpy array

Attribute

starting with >

Soft link

starting with >

External link

starting with >

  • str

  • list[str]

  • silx.io.url.DataUrl

  • list[silx.io.url.DataUrl]

  • h5py.VirtualLayout

  • dict with dictdump_schema="vds_v1"

Virtual dataset

starting with >

  • str

  • list[str]

  • silx.io.url.DataUrl

  • list[silx.io.url.DataUrl]

  • dict with dictdump_schema="external_binary_link_v1"

External binary dataset

The silx.io.dictdump.dicttoh5() function does not parse special key characters or value schema’s but requires explicit data types for the different types of HDF5 concepts:

Schema key

Value type

HDF5 meaning

dict

Group

  • any except dict

  • e.g. number, string, list, numpy array

Dataset

tuple(<item-name>, <attr-name>)

  • e.g. number, string, list, numpy array

Attribute

h5py.SoftLink

Soft link

h5py.ExternalLink

External link

h5py.VirtualLayout

Virtual dataset

silx.io.dictdumplink.ExternalBinaryLink

External binary dataset

Common Usage#

This example uses a schema describing groups, datasets, attributes, soft links, external links and virtual datasets.

../_images/hdf5fromschema.png
x = numpy.arange(110) / 50
y = numpy.random.uniform(size=110)

data = {
    "@NX_class": "NXroot",  # HDF5 attribute
    "@default": "entry",
    "entry": {
        "@NX_class": "NXentry",
        "@default": "process",
        "process": {
            "@NX_class": "NXprocess",
            "@default": "plot2d",
            "description": "Dark-current subtraction",
            "software_name": "MyReductionPipeline",
            "version": "1.0",
            "parameters": {
                "@NX_class": "NXparameters",
                "dark_current_level": 42.0,
                "threshold": 100,
            },
            "data": {
                "@NX_class": "NXcollection",
                ">x": "./raw_data.h5::/1.1/instrument/positioners/samy",  # HDF5 external link
                "y": y,
            },
            "plot1d": {
                ">y": "../data/y",  # HDF5 soft link
                ">x": "../data/x",
                "@signal": "y",
                "@axes": "x",
                "@NX_class": "NXdata",
                "title": "Dark-current subtracted",
            },
            "plot2d": {
                ">y": {  # HDF5 virtual dataset
                    "dictdump_schema": "vds_v1",
                    "shape": (10, 11),
                    "dtype": float,
                    "sources": [
                        {"data_path": "../data/y", "shape": (110,), "dtype": float},
                    ],
                },
                "@signal": "y",
                "@NX_class": "NXdata",
                "title": "Dark-current subtracted",
            },
        },
    },
}

raw_filename = os.path.join(tmpdir, "raw_data.h5")
processed_filename = os.path.join(tmpdir, "processed_data.h5")

with h5py.File(processed_filename, "a") as h5file:
    dicttonx(
        treedict=data,
        h5file=h5file,
        h5path="/",
        update_mode="replace",
        add_nx_class=True,
    )

with h5py.File(raw_filename, "w") as h5file:
    h5file["/1.1/instrument/positioners/samy"] = x

Attributes#

Attributes of groups and datasets can be defined with a key “<item-name>@<attr-name>” or alternatively for groups “@<attr-name>”. In this example we mix both notations:

data = {
    "@NX_class": "NXroot",
    "entry@NX_class": "NXentry",
    "entry": {
        "distance": [0, 1, 2],
        "distance@units": "mm",
    },
}

Virtual Datasets#

Virtual datasets allow merging, slicing and reshaping other datasets in the same file or external files.

This example uses a list of URL’s to be stacked in one 3D dataset while selecting an image ROI of [20:30,40:50]:

../_images/hdf5fromschema_vds.png
">images_roi": [
  "data0.h5?path=/group/dataset&slice=:,20:30,40:50",
  "data1.h5?path=/group/dataset&slice=:,20:30,40:50",
  "data2.h5?path=/group/dataset&slice=:,20:30,40:50"
]

Warning

When defining a virtual dataset with a list of URL’s, the source files will be opened and inspected. In addition there is no flexibility in the way the sources are merged together. See Merging URL’s for details on how data is merged (preserve shape vs. stack vs. concatenate behavior).

Here is an equivalent schema that does not open the source files but requires all sources to have the same shape and dtype:

">images_roi": {
  "dictdump_schema": "vds_urls_v1",
  "source_shape": (5, 50, 60),
  "source_dtype": "uint16",
  "sources": [
      "data0.h5?path=/group/dataset&slice=:,20:30,40:50",
      "data1.h5?path=/group/dataset&slice=:,20:30,40:50",
      "data2.h5?path=/group/dataset&slice=:,20:30,40:50"
    ],
}

Here is an equivalent schema that does not open the source files and allows defining the way the sources are merged together:

">images_roi": {
  "dictdump_schema": "vds_v1",
  "dtype": "uint16",
  "shape": (15, 10, 10),
  "sources": [
    {
      "data_path": "/group/dataset",
      "dtype": "uint16",
      "file_path": "data0.h5",
      "shape": (5, 50, 60),
      "source_index": (
        slice(None, None, None),
        slice(20, 30, None),
        slice(40, 50, None)
      ),
      "target_index": slice(0, 5, None)
    },
    {
      "data_path": "/group/dataset",
      "dtype": "uint16",
      "file_path": "data1.h5",
      "shape": (5, 50, 60),
      "source_index": (
        slice(None, None, None),
        slice(20, 30, None),
        slice(40, 50, None)
      ),
      "target_index": slice(5, 10, None)
    },
    {
      "data_path": "/group/dataset",
      "dtype": "uint16",
      "file_path": "data2.h5",
      "shape": (5, 50, 60),
      "source_index": (
        slice(None, None, None),
        slice(20, 30, None),
        slice(40, 50, None)
      ),
      "target_index": slice(10, 15, None)
    }
  ]
}

External Binary Data#

External binary data can be concatenated as a dataset.

This example uses a list of TIFF files to be concatenated in one 3D dataset:

../_images/hdf5fromschema_tiff.png
">images": ["data0.tiff", "data1.tiff", "data2.tiff", "data4.tiff", "data5.tiff"]

Warning

When defining an external binary dataset with a list of filenames, the source files will be opened and inspected. In addition the HDF5 dataset will store absolute file names so moving the data will break the link. See Merging URL’s for details on how data is merged (preserve shape vs. stack vs. concatenate behavior).

Here is an equivalent schema that can be used for any binary data which is contiguous and uncompressed:

">images": {
    "dictdump_schema": "external_binary_link_v1",
    "dtype": numpy.uint16,
    "shape": (5, 50, 60),
    "sources": [
        {"file_path": "data0.tiff", "offset": 196, "size": 6000},
        {"file_path": "data1.tiff", "offset": 196, "size": 6000},
        {"file_path": "data2.tiff", "offset": 196, "size": 6000},
        {"file_path": "data3.tiff", "offset": 196, "size": 6000},
        {"file_path": "data4.tiff", "offset": 196, "size": 6000},
    ],
}

Merging URL’s#

One or several URL’s can be merged in a single virtual dataset or external binary dataset.

When providing a single URL, as a string (not a list with one element), the merged dataset has the same shape as the single source.

When providing a list of URL’s, even when the list has only one element, the sources are stacked when the source rank ndim<3 and concatenated when ndim>=3.

Examples for Nt targets:

  • target shape=() : VDS shape (Nt,)

  • target shape=(N0,) : VDS shape (Nt,N0)

  • target shape=(N0,N1) : VDS shape (Nt,N0,N1)

  • target shape=(N0,N1,N2) : VDS shape (Nt*N0,N1,N2)

  • target shape=(N0,N1,N2,N3) : VDS shape (Nt*N0,N1,N2,N3)

  • target shape=(N0,N1,N2,N3,N4) : VDS shape (Nt*N0,N1,N2,N3,N4)

Warning

Since the sources are merged in a single dataset their shapes must be consistent.