Composite Data Sets for the VTKHDF format

This thread is dedicated to proposing a new design for the VTKHDF format that would describe vtkPartitionedDataSetCollections and vtkMultiBlockDataSets.

The general idea is to have an Assembly group in the file that describes the composite data hierarchy and have its leaf nodes link to top level groups that conform to the existing VTKHDF formats using the symbolic linking mechanisms provided by HDF5 technology (Chapter 4: HDF5 Groups and Links).

Here is a diagram of what it might would look like:

Some details:

  • The Index attributes inform the reader of the index of a given block in the flattened composite structure.
  • Every leaf in the assembly should describe a non-composite data object to avoid the overhead of recursion while reading the file.
  • The assembly structure only needs to be traversed once in the beginning of the reading procedure (and can potentially be read and broadcast only by the main process in a distributed context) to optimize file meta-data reading.
  • The block wise reading implementation and composite level implementation can be managed independently from each other.
  • It would be feasible for each block to have its own time range and time steps in a transient context with the full composite data set able to collect and expose a combined range and set of time values.
  • Reading performance would scale linearly with the number of blocks even in a distributed context.

Does anyone have any input? What are some of the thoughts in the community on the best way to do this?

1 Like

@mwestphal @danlipsa @lgivord

FYI @hakostra

1 Like

Hi, thanks for pining me.

I am not a user of any of these formats right now. However, I think it’s great that they gets implemented in the VTKHDF format, because the VTKHDF format will see a broader audience with more implemented formats.

The structure seems logical and makes totally sense.

2 Likes

Looks good. Some comments:

  • Do you need to duplicate Version in all blocks? I would rather add some code to propagate the Version from the parent and avoid the duplication.
  • Maybe choose a different name for Assembly to avoid confusion with vtkDataAssembly? Hierarchy maybe?
  • Are you going to have a different type for Multiblock and PartitionedDataSetCollection? Probably yes because the indexing scheme is different.

Hi @danlipsa,

Thanks for the comments!

  • Do you need to duplicate Version in all blocks? I would rather add some code to propagate the Version from the parent and avoid the duplication.

I think you are right, we probably would not need the duplicate versioning in every block. We will update the design.

  • Maybe choose a different name for Assembly to avoid confusion with vtkDataAssembly? Hierarchy maybe?

Our idea here was to indeed mirror the vtkDataAssembly since it actually corresponds to the information that is going to be included in the object. If this is still a blocking point however, we can rename the top level block.

  • Are you going to have a different type for Multiblock and PartitionedDataSetCollection? Probably yes because the indexing scheme is different.

At this point, it is still relatively open. We can either have the root Type variable indicate the type one wishes as an output (either vtkPartitionedDataSetCollection or vtkMultiBlockDataSet) or we could accept a generic type name (such as Collection or Composite) and have the type be a property of the reader that we can switch dynamically. Do you have any preference for one or the other?

I do not think it’s a bad idea to have the version tag on all blocks. Do we completely rule out that the storage schemes for the Polydata, unstructured or image data ever change in the future?

With the original proposal, one could imagine having a composite dataset of Unstructured grids where each block has version 1.0, since that was the file format version that included the unstructured grid description. Despite the entire file as a whole has version 3.0 (or whatever > 2.0). That indicate not just the type (polydata, unstructured, image) but also the storage layout version.

I also do not sew any gains in dropping it. We already have writers for various file formats that write the group with the data structures including the version tag, and not writing it will be more of a complication than just always writing it. Same thing with reading. If you just point any of today’s readers (Polydata, unstructured, image) towards a group in a compoiste, it will try to read the version tag. Implementing mechanisms to conditionally skip reading those few bytes would in my opinion not be with the potential gains…

1 Like

Julien Fausty:

  • Do you need to duplicate Version in all blocks? I would rather add some code to propagate the Version from the parent and avoid the duplication.

I think you are right, we probably would not need the duplicate versioning in every block. We will update the design.

  • Do you need to duplicate Version in all blocks? I would rather add some code to propagate the Version from the parent and avoid the duplication.

I do not think it’s a bad idea to have the version tag on all blocks. Do we completely rule out that the storage schemes for the Polydata, unstructured or image data ever change in the future?

Indeed the version can change. But the file will get the version of the current writer, and such there is only one version for the whole file.

With the original proposal, one could imagine having a composite dataset of Unstructured grids where each block has version 1.0, since that was the file format versio that included the unstructured grid description.

Once the unstructured grid was read in memory the version is not relevant anymore. That version is only the version for the VTKHDF storage.

Despite the entire file as a whole has version 3.0 (or whatever > 2.0). That indicate not just the type (polydata, unstructured, image) but also the storage layout version.

I also do not sew any gains in dropping it.

Why would it be useful or needed to want to write a file where different blocks are written with different versions of the format?

I think reducing complexity is generally beneficial unless there is a clear reason why we want to make things more complex.

In that case I think you need more info:

  • What you have so far is enough to rebuild the partitioned dataset collection. As far as I understand you’ll always need 2 levels for that tree. Nodes at depth 1 are partitioned datasets (PD) (with several partitions) and nodes at depth 2 are partitions (blocks) linking to the actual data.
  • you’ll need zero or one data assemblies which is a tree, where nodes also have a name attribute, that organize the PDs in the PD collection.

I prefer to have the type spelled out just because you would lose information if you read a multiblock from a PD collection with a data assembly or similarly the node names will have to be created if you read a PD collection (with a data assembly) from a multiblock. This might be confusing. You can convert to the data you need using an additional conversion filter.

  • What you have so far is enough to rebuild the partitioned dataset collection. As far as I understand you’ll always need 2 levels for that tree. Nodes at depth 1 are partitioned datasets (PD) (with several partitions) and nodes at depth 2 are partitions (blocks) linking to the actual data.

Conceptually, given the partitioning present already in the PolyData and UnstructuredGrid formats, the blocks can already be treated as Partitioned Data Sets. This is what is reflected in this: https://gitlab.kitware.com/vtk/vtk/-/merge_requests/10355 relatively new development. The Option MergeParts turns the output into a Partitioned Data Set when deactivated. As such, those two levels are already baked into the format, are well separated into two mechanisms and conceptually correspond to the division between vtkPartitionedDataSetCollection and vtkPartitionedDataSet.

  • you’ll need zero or one data assemblies which is a tree, where nodes also have a name attribute, that organize the PDs in the PD collection.

Yep, this is indeed what is represented in the diagram above by the special Assembly group. The names associated to each node of the tree are the names of the groups and blocks. This might be unclear in the representation and any suggestions for making it clearer are welcome.

I prefer to have the type spelled out just because you would lose information if you read a multiblock from a PD collection with a data assembly or similarly the node names will have to be created if you read a PD collection (with a data assembly) from a multiblock. This might be confusing. You can convert to the data you need using an additional conversion filter.

I am not sure I follow. There is a one to one correspondence between multi-block and PD collection, is there not? What information is being lost in the conversion? In the multi-block structure each block still has a name, right?

I see. That’s great! What do you do for ImageData? Is that also considered a partitioned dataset?

Sounds good. What happens when you save a PD collection that does not have a data assembly. How do you come up with the names for the PDs? Based on index?

I never used names for multiblock. What is the API to set/get names of nodes in the tree?

@danlipsa, sorry for the delay!

What do you do for ImageData? Is that also considered a partitioned dataset?

For now, only the unstructured data can be output as a partitioned dataset given the format of the data. We could imagine rather easily extending this to image data by putting one partition per MPI rank.

What happens when you save a PD collection that does not have a data assembly. How do you come up with the names for the PDs? Based on index?

I would imagine that we would follow whatever convention the vtkPartitionedDataSetCollections have now in other contexts (writers, ParaView menus, etc.). Most likely a name based off the index of the block in the collection indeed.

What is the API to set/get names of nodes in the tree?

  • someDataObjectTree->GetMetaData(index)->Set(vtkCompositeDataSet::NAME(), name)
  • someDataObjectTree->GetMetaData(index)->Get(vtkCompositeDataSet::NAME())

Great. Thanks for the explanations. The design looks good.

Dan

1 Like

For your information the merge request to add the support for composite vtkHDF file format is here : https://gitlab.kitware.com/vtk/vtk/-/merge_requests/10747

1 Like

This is a discussion that came from the MR implementing this feature.

Q:
I have a question about the design of the hierarchy specification. Apparently, the hierarchy is fixed, and each Leaf of the Assembly points to a node that represents a VTKHDF valid root node. So each Leaf could point to a transient UnstructuredGrid.

In my own simulations, the hierarchy is not fixed. For each frame, I have a hierarchy, and in some frames, that hierarchy includes some UnstructuredGrids and in some others it doesn´t include them. For example, if the simulation detects collisions, I will export a UnstructuredGrid representing data about the collisions, if not, I won´t export that data for that particular frame.

Is it possible to represent that kind of varying assembly using the current specification?

A:
In fact its not really related to the design but own we implement it.

Currently if a leaf is pointing to nothing it will be ignore so your use case will be expected to be supported BUT I don’t test it so if it’s not the case you can also create a valid empty UG as workaround.


In the case I describe, for some timestamps I have data, but for others I dont have data. In the case of collisions for example, I am exporting a hierarchy for each timestamp, This hierarchy has a node for collisions if in that frame I had collisions. Otherwise the hierarchy doesn’t have a node for the collisions.

In the design describe here, the Assembly is fixed (it is read only once), so I will have to export the collisions node as part of the hirarchy. You say that if the leaf points to nothing, it will be ignored, but in my case the leaf will points to something. At frame 10 (for example) I have collisions, so I will have to fill data for that timestamp. Then, at frame 20, the collisions are resolved and I dont have anymore. What should I write in that case in the transient VTKHDF node representing the collisions?

indeed, in this case it’s more related to transient vtkHDF.

Your UnstructuredGrid, with 20 timesteps, will have a NumberOfPoints with a size of 20. So for each timestep you can specify the number of points in this array (so in your case at frame 20, as there is no collision it will be equal to 0) same for NumberOfCells, you can check this data for example:

HDFWriter_transient_cube.hdf.vtkhdf (40.4 KB)

Ill test this approach.

Thanks for your work and answers

@lgivord I was checking the approach you proposed, to set the number of points for a particular timestep to 0, but I am not sure how to do it. In the file you sent, you set the NumberOfPoints and Cells for each timestep, but I can’t find the documentation for that.

Hi @Juan_Jose_Casafranca,

In order to save memory space, the logic for transient data and partitioned data is quite the same thing, that’s why in the documentation for transient data we said that:

The general idea is to take the static formats described above and use them as a base to append all the time dependent data. As such, a file holding static data has a very similar structure to a file holding dynamic data.

So in your case if we take the unstructured grid, the documentation said for partitioned data :

We describe the split into partitions using HDF5 datasets NumberOfConnectivityIds, NumberOfPoints and NumberOfCells. Let n be the number of partitions which usually correspond to the number of the MPI ranks. NumberOfConnectivityIds has size n where NumberOfConnectivityIds[i] represents the size of the Connectivity array for partition i

We can mirroring that here, instead of having n ranks we have here n timesteps and so for unstrutured grid we need to define several array of size n:

  • NumberOfConnectivityIds
  • NumberOfPoints
  • NumberOfCells

However it’s not sufficient that why we also have another group named Steps.

Finally, after writing this post, its clear for me that if someone is only interested on transient data the documentation isn’t really easy to understand, we may want to improve it.

EDIT: I open an issue here https://gitlab.kitware.com/vtk/vtk/-/issues/19242

I understand the logic. However, there are a couple of things I dont completely get:

The Offsets in the UG needs to have number of cells + 1 entries. Is this still the case for the transient? In that case, if I have 2 triangles, I would have Offsets = [0, 3, 6] for example. If this is transient, it could be Offsets = [0, 3, 6, 0, 3, 6], and Steps/CellOffsets should be in that case should be CellOffsets = [0, 3], right?

There is no transient data for the cell type? Wwhat happens in case I have different number of cells? If the reader going to read them in order? So for example if the first step have 3 cells and the second one 4, it will read from 0 to 3 entries in the Types dataset for the first frame and from 3 to 7 from the second step?