Motivations
As defined in the VTKHDF format definition, partitions of a given partitioned dataset all store their data in a common HDF5 Dataset (H5D). Then, the reader uses offsets to know where to read the data for a given partition. This mechanism is also used for temporal data, even when the dataset is not partitioned. This choice has been made in order to increase reading performance. Reading this data is implemented efficiently in the reader, using cache for partitioned data.
However, there has been an increasing demand to allow writing VTKHDF files in multiple files: one for each time step, partition, or composite assembly partitioned dataset. This proposal addresses this issue, suggesting a unified approach to break up VTKHDF partitioned datasets across multiple files.
Technical proposal
The specification of the format will not change, and the VTKHDF reader will not be impacted by this change. To make it possible, we leverage 2 different native HDF5 features:
- External links: A group of a HDF5 file can be linked to a group in a different file, making for easy data distribution between files. We can leverage that in the VTKHDF writer so that every leaf of a composite dataset is stored in a different file, and the main file only stores the assembly and links to these individual files.
- Virtual datasets: A given dataset can be made of multiple virtual parts, that can be datasets from external files. Using this, we can build valid VTKHDF files representing individual partitions, and a main one that builds its data using virtual datasets referencing data from the individual partitions files.
vtkHDFWriter will expose 3 new options. Every combination of them should be valid.
-
UseExternalComposite (only for PartitionedDatasetCollection) : when active, write each non-composite block of the assembly in a different .vtkhdf file. Each block file will be readable either individually in the VTKHDF format, or through the “master/main” VTKHDF file that references them. The main file itself will only contain metadata and references to the block files, and should be lightweight.
-
UseExternalTimesteps (only for temporal datasets): when active, write each time step in a non-temporal separate VTKHDF file, as well as a main file referencing them.
-
UseExternalPartitions (for partitioned datasets, or temporal datasets that have more than 1 partition per time step): when active, write partitions in individual files.
Caveats
- Static meshes or arrays are not supported, because we want individual parts to be read in a standalone way.
- The writer uses relative paths for external links, so the main file cannot be read if the VTKHDF files it references are not located at the required location.
Do you have any feedback on this proposal?