VTKHDF proposal: Support splitting datasets in multiple files

Jacques-Bernard · May 2, 2024, 2:57pm

Hye guys,

Thank you @Louis_Gombert for bringing your thoughts here.

In the introduction, a little clarification:

I admit that I don’t understand how the mechanism put in place today guarantees a level of reading performance? Much depends on the size of the simulation data and the choice of chunk size.

RE-reads (even offset) are avoided by activating the Cache mechanism.
It seems to me that the cached data is directly used so as not to double the memory cost.
The additional cost is therefore zero in the case where you have a single partition on the server.

But, this is no longer so true when you activate the Merge mode of the player.
Indeed, the (successive) application of the vtkAppendDataSet filter involves the creation of a new mesh with its data resulting in a (quasi)duplication of the data.
Furthermore, it has been noted that adding or removing a value field when loading (at the nodes or cells) again triggers the costly application of the vtkAppendDataSet filter. It is the same if the mesh does not evolve in timesteps. This is something in the future that will require intervention (use the force… thanks to the wonderful virtual arrays).

First return:

I have just become aware of your proposal which has the advantage of not impacting the operation of the current reader.

I find the proposal very interesting with the additional information from @hakostra.

However, it seems to me that in an HPC context (large simulation with a large number of partitions) I will be limited in the writing performance of a simulation time by the number of files which is fixed, if I understand correctly, by the number of blocks in my MBDS hierarchical descriptions.

Unless I’m mistaken, it seems to me that this proposal does not allow me:

write in parallel on different files to output a first simulation time (1);
then complete these same files by writing the following simulation times (2).

(1) which a priori makes it possible to improve write output times
(2) which allows you to limit the number of files, which is consistent with the desire to limit inodes on a supercomputer

Finally, I find the proposal of UseExternalTimesteps @Louis_Gombert very satisfactory… all the more so, if we can associate the global FieldData fields specific to the simulation such as the cycle number, the simulation resolution time step and other characteristics.