VTKHDF proposal: Support splitting datasets in multiple files

Louis_Gombert · April 26, 2024, 8:47am

Motivations

As defined in the VTKHDF format definition, partitions of a given partitioned dataset all store their data in a common HDF5 Dataset (H5D). Then, the reader uses offsets to know where to read the data for a given partition. This mechanism is also used for temporal data, even when the dataset is not partitioned. This choice has been made in order to increase reading performance. Reading this data is implemented efficiently in the reader, using cache for partitioned data.

However, there has been an increasing demand to allow writing VTKHDF files in multiple files: one for each time step, partition, or composite assembly partitioned dataset. This proposal addresses this issue, suggesting a unified approach to break up VTKHDF partitioned datasets across multiple files.

Technical proposal

The specification of the format will not change, and the VTKHDF reader will not be impacted by this change. To make it possible, we leverage 2 different native HDF5 features:

External links: A group of a HDF5 file can be linked to a group in a different file, making for easy data distribution between files. We can leverage that in the VTKHDF writer so that every leaf of a composite dataset is stored in a different file, and the main file only stores the assembly and links to these individual files.
Virtual datasets: A given dataset can be made of multiple virtual parts, that can be datasets from external files. Using this, we can build valid VTKHDF files representing individual partitions, and a main one that builds its data using virtual datasets referencing data from the individual partitions files.

vtkHDFWriter will expose 3 new options. Every combination of them should be valid.

UseExternalComposite (only for PartitionedDatasetCollection) : when active, write each non-composite block of the assembly in a different .vtkhdf file. Each block file will be readable either individually in the VTKHDF format, or through the “master/main” VTKHDF file that references them. The main file itself will only contain metadata and references to the block files, and should be lightweight.
UseExternalTimesteps (only for temporal datasets): when active, write each time step in a non-temporal separate VTKHDF file, as well as a main file referencing them.
UseExternalPartitions (for partitioned datasets, or temporal datasets that have more than 1 partition per time step): when active, write partitions in individual files.

Caveats

Static meshes or arrays are not supported, because we want individual parts to be read in a standalone way.
The writer uses relative paths for external links, so the main file cannot be read if the VTKHDF files it references are not located at the required location.

Do you have any feedback on this proposal?

Louis_Gombert · April 26, 2024, 8:47am

tagging @mwestphal @lgivord @hakostra @cory.quammen

mwestphal · April 26, 2024, 8:50am

Unless I’m mistaken, this proposal is fully retrocompatible with the previous spec, which means this will not impact performance for people relying on offsets and not using separate files, I believe.

Louis_Gombert · April 26, 2024, 8:52am

Of course, I mean performance could be impacted only in case this new specification is used. Otherwise, there would not be any regression.

hakostra · April 26, 2024, 10:20am

Looks like a good enhancement.

Are you aware of the virtual dataset (VDS) functionality in the HDF5 library?

Ref HDF5: HDF5 Virtual Dataset (VDS) Documentation

Ref h5py: Virtual Datasets (VDS) — h5py 3.11.0 documentation

Using VDS one can achieve the same thing without any change in reader or format at all (although the writer that write the “master” file must know it of course). It essentially allows you to set that the first 50 points in a dataset maps to one dataset in a specific file, the next 50 points to another dataset in another file etc. The functionality use the well-known hyperslab definitions in HDF5 so you can slice in any dimension as you like.

Louis_Gombert · April 26, 2024, 10:34am

Hi Håkon, thanks for pointing that out! I overlooked that HDF5 feature, but it looks like it solves our problem. I’ll see what it can do in practice. We may not need to change the VTKHDF specification to provide these new options for the writer, which is great news.

Louis_Gombert · April 29, 2024, 1:35pm

Hello, after experimenting with the Virtual Dataset feature, it seems to correspond to our needs, and is completely transparent to the reader. Thus, the specification will not need to change to support these new features. The original post has been edited to reflect that. Thanks again for your feedback!

hakostra · April 29, 2024, 3:25pm

Glad to be of any use. I (we) have been using this feature since it was introduced in HDF5 1.10 and it really works great and can solve many problems like this. I think the updated idea/specification is good!

Jacques-Bernard · May 2, 2024, 2:57pm

Hye guys,

Thank you @Louis_Gombert for bringing your thoughts here.

In the introduction, a little clarification:

I admit that I don’t understand how the mechanism put in place today guarantees a level of reading performance? Much depends on the size of the simulation data and the choice of chunk size.

RE-reads (even offset) are avoided by activating the Cache mechanism.
It seems to me that the cached data is directly used so as not to double the memory cost.
The additional cost is therefore zero in the case where you have a single partition on the server.

But, this is no longer so true when you activate the Merge mode of the player.
Indeed, the (successive) application of the vtkAppendDataSet filter involves the creation of a new mesh with its data resulting in a (quasi)duplication of the data.
Furthermore, it has been noted that adding or removing a value field when loading (at the nodes or cells) again triggers the costly application of the vtkAppendDataSet filter. It is the same if the mesh does not evolve in timesteps. This is something in the future that will require intervention (use the force… thanks to the wonderful virtual arrays).

First return:

I have just become aware of your proposal which has the advantage of not impacting the operation of the current reader.

I find the proposal very interesting with the additional information from @hakostra.

However, it seems to me that in an HPC context (large simulation with a large number of partitions) I will be limited in the writing performance of a simulation time by the number of files which is fixed, if I understand correctly, by the number of blocks in my MBDS hierarchical descriptions.

Unless I’m mistaken, it seems to me that this proposal does not allow me:

write in parallel on different files to output a first simulation time (1);
then complete these same files by writing the following simulation times (2).

(1) which a priori makes it possible to improve write output times
(2) which allows you to limit the number of files, which is consistent with the desire to limit inodes on a supercomputer

Finally, I find the proposal of UseExternalTimesteps @Louis_Gombert very satisfactory… all the more so, if we can associate the global FieldData fields specific to the simulation such as the cycle number, the simulation resolution time step and other characteristics.

Louis_Gombert · May 3, 2024, 7:46am

Hi @Jacques-Bernard

I admit that I don’t understand how the mechanism put in place today guarantees a level of reading performance? Much depends on the size of the simulation data and the choice of chunk size.

Reading performance can indeed vary depending on the chosen chunk size. Virtual Datasets still use chinking, and should not impact performance for large simulations where it matters the most.

RE-reads (even offset) are avoided by activating the Cache mechanism.

Correct, vthHDFReader uses a cache for partitioned data when merge mode is disabled.

write in parallel on different files to output a first simulation time (1);

This proposal can allow to write multiple files corresponding to different partitions in parallel on different processes. Only the “main” file needs to be serial to aggregate data using Virtual DS from the other files.

then complete these same files by writing the following simulation times

When using the option “UseExternalPartitions”, partitions for each vtkPartitionedDataset are written in different files. With “UseExternalTimesteps”, you can separate data for each timestep in different files. So when you activate “UseExternalPartitions” but not “UseExternalTimesteps”, you can append data in the same partition file for each timestep.

Jacques-Bernard · May 3, 2024, 10:01am

Hi @Louis_Gombert

Reading performance can indeed vary depending on the chosen chunk size. Virtual Datasets still use chinking, and should not impact performance for large simulations where it matters the most.

What I wanted to point out in the introduction is that for me, the performance guarantee does not exist because it depends on many parameters including the size of the chunk set by the user according to their simulation.
Now, I completely agree with you when you say that the use of Virtual Layouts will have almost no impact.

This proposal can allow to write multiple files corresponding to different partitions in parallel on different processes. Only the “main” file needs to be serial to aggregate data using Virtual DS from the other files.

Yes for one timestep.
But, If you want to do the same thing on the second timestep, you will then have to create as many new files… which is not acceptable from the point of view of the parallel filesystem (limit inodes, have the largest files at the end of the simulation). That’s where it all goes wrong…

When using the option “UseExternalPartitions”, partitions for each vtkPartitionedDataset are written in different files. With “UseExternalTimesteps”, you can separate data for each timestep in different files. So when you activate “UseExternalPartitions” but not “UseExternalTimesteps”, you can append data in the same partition file for each timestep.

…but in this case, you will explode the number of files.

This document (
https://www.alcf.anl.gov/sites/default/files/2022-07/HDF5-Foundation-parallel.pdf) gives an interesting insight into a way of doing HDF in an HPC framework with MPIO-IO. Using VirtualLayout allows you to split a table across several files.
Parallel writing performance is obtained by adjusting the number of strips and their size in the configuration of the parallel file system files (in this case Luster).
But, echoing the first point of response, the user should not miss the values associated with these parameters.
In my opinion, this is what means that obtaining performance with HDF/MPI-IO is not the easiest… but that is not really your problem.

I would therefore say that following this approach this proposal seems satisfactory (outside of vtkAppendDataSet aspect).

If we start with that, my question about externalizing global fields with simulation times via UseExternalTimesteps is no longer relevant.

One last quick question (I’m Mister Bahlsen!), do you use HDF/MPI-IO for loading or do you use HDF in sequential mode (per server)?

Thank you @Louis_Gombert for these details.

hakostra · May 3, 2024, 10:23am

The file format specification allows you to easily construct a writer or your application that write the output exactly as you want it to. That is the big advantage of the vtkhdf compared to other file formats VTK use. I have for instance a parallel writer in Fortran that write time-dependent PolyData from my MPI application into a single vtkhdf-file. That was quite straight-forward to implement from the file format specification.

In my opinion the most important is that the file format itself is specified in a way that allow efficient read and write, independent on the actual implementation. Implementation details can always easily be changed, but the specification is more or less carved in stone once it is out. As far as I see it that is the case.

Then the second most important thing is that the reader is performant and reads the vtkhdf-files in an efficient manner. As far as I see it, that is also the case here.

Then, the official writer is a least priority here… Not that it is not important, but I think it is a lot of work to make the writer fit every HPC application and file system combo in the universe.

Louis_Gombert · May 3, 2024, 11:16am

The VTKHDF reader currently does not use HDF/MPI-IO, but the VTK MPI implementation, which makes sense as a VTK filter and is easier to integrate with the rest of the VTK API.

Jacques-Bernard · May 3, 2024, 1:41pm

Hi Håkon,

It is with our experience in HPC that we are looking with big interest at the implementation of the VTK HDF which could become an open-source alternative to our existing proprietary solution ; we even accompany it.
The criteria of performance, storage volume but above all scalability are important for us both in massively parallel writing and in reading (by the writing code or another than by the analysis tool) in different forms of execution.
In fact, our evaluation covers both the VTK HDF implementation and HDF itself.

The devil is always in the detail, which is why it is interesting to exchange in order to adapt and evaluate in a different context.

Concerning the VTK HDF reader, and depending on the uses we make of it, we notice in certain cases significant losses in performance (indicated above); nothing blocking for the moment even if it raises a worrying signal when we increase the number of cells and partitions.

I completely agree with you on the low priority of setting up an “intelligent” VTK HDF writer, nevertheless the reflection/proposal addressed here by Kitware allows us to find solutions to the difficult points that we had identified.
Thus, and I have only just assimilated it, ExternalLink allows you to share common information written in another file such as the list of times or the values of global fields.

Jacques-Bernard · May 3, 2024, 1:46pm

I completely understand, functionality above all.
It could be interesting in an HPC context to have a reader using HDF/MPI-IO (this nevertheless remains to be verified).