HyperTreeGrid support in VTKHDF

The overlappingAMR data structure has been recently added to the VTKHDF specification, now is time for its cousin, the Hyper Tree Grid (HTG for short) to be specified as well.

HTG is a compact and memory-efficient tree-based AMR data structure, that has been around in VTK for more than a decade. Many HTG-specific algorithms have been created, using a cursor mechanism coupled with tree iteration to browse through it. HTGs can be stored in the VTK XML format using the .htg extension. The parallel .phtg implementation is currently in a non-functionning state.

This post suggests a specification for the up-and-coming VTKHDF format for HTG, which will allow efficient distributed writing and reading, temporal support, and composite structures, distributed in a single or in multiple files.

The general structure of the file would look like this

GROUP "VTKHDF"
    ATTRIBUTE "Version"      // Updated to (2,4)
    ATTRIBUTE "Type"         // "HyperTreeGrid" In this case

    // HTG-specific properties
    ATTRIBUTE "BranchFactor"            // 2 or 3, cell division factor
    ATTRIBUTE "Dimensions"              // 3-Vector
    ATTRIBUTE "InterfaceInterceptsName" // String referencing a cell data array
    ATTRIBUTE "InterfaceNormalsName"    // String referencing a cell data array
    ATTRIBUTE "TransposedRootIndexing"  // Bool, true if the indexing mode of the grid is inverted

    // Grid point coordinates
    DATASET "XCoordinates"
    DATASET "YCoordinates"
    DATASET "ZCoordinates"

    // HTG Specific fields
    DATASET "NumberOfTrees"             // One entry for each distributed part
    DATASET "DescriptorsSize"           // One entry for each distributed part
    DATASET "DepthPerTree"              // size = sum(NumberOfTrees), maximum depth for each tree
    DATASET "Descriptors"               // Packed bit array, size = sum(DescriptorsSize)
    DATASET "NumberOfCellsPerDepth"     // size = sum(DepthPerTree), number of cells for each depth of each tree
    DATASET "TreeIds"                   // size = sum(NumberOfTrees)
    DATASET "Mask"                      // Packed bit array, size = sum(NumberOfCellsPerDepth)

    GROUP "FieldData"
        ...
    GROUP "CellData"
        ...

If you’re aware of the .htg v2 format, this is mostly a mapping of its content, with the addition of a few offset fields to support efficient distributed reading.

Misc. Notes:

  • “NumberOfTrees” has one value for each part of the distributed dataset, and is optional for non-distributed data or if the number of trees in each part is the same.
  • “DescriptorsSize” is required for distributed datasets and used as an offset array in the “Descriptors” bit set. Without it, each rank would need to process all of the previous trees to calculate the read offset for descriptors.
  • “TreeIds” must have all values 0->N, but in any order. This allows distributing trees across parts
  • “Mask” is optional, and omitted if no cell is masked.
  • “NumberOfCellsPerDepth” is renamed from “NumberOfVerticesPerDepth” in the XML format. It allows for faster reading when limiting reading depth. The reader can compute the number of cells to offset for a given tree using this value and “DepthPerTree”.
  • There is no “PointData” for the HTG structure, because we only consider cells.
  • For temporal, offsets for NumberOfTrees, Descriptors, NumberOfCellsPerDepth and NumberOfCells will be required in the “Steps” group, similarly to the other types of VTKHDF datasets. This way, the reader can easily pick up any time step without any pre-computation.

Any comment or suggestion is welcome!

cc @mwestphal @Charles_Gueunet @hakostra @lgivord

1 Like

Nice writeup.

Could you clarify if distributed data is supported in this VTKHDF specification ?

It totally is. The sentence you quoted is about the current XML PHTG data format, which I could not get to work properly.

1 Like

There are two datasets in the proposal, Descriptors and Mask that are described as packed bit arrays.

My experience with packed bit arrays (in my interpretation a datatype a bool 0/1 value with 1 bit storage space, rounded up to the nearest byte in length) is that different languages/frameworks work differently.

Would the following Python example give the desired encoding?

import h5py as h5
import numpy as np

bitlist = np.zeros(100, dtype=np.uint8)

with h5.File('bitfield.h5', 'w') as fh:
    bitfield = np.packbits(bitlist)
    fh.create_dataset('bitfield', data=bitfield)

100 bool elements in the “bitlist” (in which 0 = false, any other value = true) occupy 12.5 bytes and the dataset in the file therefore ends up being 13 elements of uint8 type.

Or is this not what you expect?

I know vtkBitArray is encoded such that each byte stores eight bool values, but not every programming language and framework has such a class/datatype natively. I think one of the great advantages of vtkhdf is that it is relatively easy to write custom writer functions/methods/classes/tools from applications in various languages (Python, C, C++, Fortran, …) without relying on the vtk library itself. The writing of these packed bit arrays needs to be possible without too much manual encoding work from all of these languages.

Hey Håkon, thanks for your feedback. I did not share it but it is exactly what I’m doing in my test writer:

    descriptors = (
          # Tree 0
          1,  # Depth 1
            0, 0, 1, 0, 0, 0, 0, 0, # Depth 2
          # Tree 1
          1, # Depth 1
            0, 1, 0, 0, 0, 0, 0, 0, # Depth 2
          # Tree 3 : refined
          1,
          # Tree 4 : not refined
    )
    packed_descriptors = np.packbits(descriptors)
    root.create_dataset("Descriptors", data=packed_descriptors)

After a quick survey of bit packing techniques, using uint8 seem like a safe bet.

Seems like a reasonable approach. I asked because I the HDF5 library has some mention on bitfields in the manual:

https://support.hdfgroup.org/documentation/hdf5/latest/_h5_t__u_g.html#subsubsec_datatype_other_bitfield

but I do not exactly know what features the library offer, and for instance what features are exposed and available from important packages like h5py.

I think your suggestion of manually packing the bitfield is sensible because you can achieve it easily in any language that allows bit-manipulation of integers.

1 Like