About inlined multiblock and partitioned dataset XML data format

mwestphal · January 19, 2023, 5:25pm

Currently, .vtm (multiblock), .vtpd (partitioned dataset) and .vtpc (partitioned dataset collection) only support file pointers.

eg:

<?xml version="1.0"?>
<VTKFile type="vtkPartitionedDataSet" version="1.0" byte_order="LittleEndian" header_type="UInt32" compressor="vtkZLibDataCompressor">
  <vtkPartitionedDataSet>
    <DataSet index="0" file="testxmlpartds/testxmlpartds_0.vti"/>
    <DataSet index="1" file="testxmlpartds/testxmlpartds_1.vti"/>
  </vtkPartitionedDataSet>
</VTKFile>

This .vtpd file do not contain actual data but only metadata about the partitioned dataset and pointers to each of the partition using the <DataSet file="relative/path/to/file.ext" xml syntax.

This is very useful in many usecases, as it lead to an easy inspection of individual files when needed, perfect for scaling up, reusing shared files and distributed reading in HPC context.

However, there is two cases where it can causes issues:

When sharing a files, one needs to (think about and then) create an archive containing all the needed files. We can see that this is often forgotten by beginners.
When reading in serial very distributed files, there is an overhead to opening and closing files which can add up to something really impactful.

One solution to this issue to consider is to add the possibility to inline data directly in the .vtm/vtpc/vtpd file, like this:

<?xml version="1.0"?>
<VTKFile type="vtkPartitionedDataSet" version="1.0" byte_order="LittleEndian" header_type="UInt32" compressor="vtkZLibDataCompressor">
  <vtkPartitionedDataSet>
    <DataSet index="0" inlined="true">
       <ImageData WholeExtent="0 10 0 10 0 5" Origin="0 0 0" Spacing="1 1 1">
        <Piece Extent="0 10 0 10 0 5"                                                     >
          <PointData Scalars="RTData">
            <DataArray type="Float32" Name="RTData" format="appended" RangeMin="-16.577068329"            RangeMax="260"                  offset="0"                   />
          </PointData>
          <CellData>
          </CellData>
        </Piece>
      </ImageData>
    </DataSet>
    <DataSet index="1" inlined="true">
    ...
    </DataSet>
  </vtkPartitionedDataSet>
  <AppendedData encoding="base64">
   _AQAAAACAAABYCwAAcAoAAA==...
  </AppendedData>
  <AppendedData encoding="base64">
   ...
  </AppendedData>
</VTKFile>

Note the appended binary data, in separated xml block for each inlined dataset.

Of course, this file would not be optimized for distributed reading but this is not the objective here.

One could argue that the same logic could be applied to .pvtx files, but these files are dedicated to distributed datasets and use a different syntax parsed in a different part of VTK XML code, so it seems to be not very useful and outside of the scope of this proposition.

Please share your thoughts.

Jacques-Bernard · February 2, 2023, 12:19pm

The current modus operandi imposes to create as many files as there are meshes as well as multiblock/partitioned dataset.
Some simulations opt for many small meshes (a lot of thousands) representing a part of the simulation in order, during visualization and analysis, to be able to isolate them easily thanks to the Multibloc Inspector.
Let’s take the case of an HPC simulation which, after post production, produces a single file of results nevertheless differentiated by element of a building: walls, doors, windows, rooms…
In an HPC context, the explosion of files whose inodes pose problems.

We easily admit that when our codes (or those of our collaborators) come out of the VTK XML, it does so in ASCII. It does not require linking with VTK, building a VTK representation before saving it.
Of course, we appreciate the binary rewrite offered through ParaView.

This is why we would be very interested in being able to write the content of our simulation in a single VTK XML file in ASCII.

We would rather have imagined replacing the line:

    <DataSet index="0" file="testxmlpartds/testxmlpartds_0.vti"/>

by the contents of this file and the same action for the next description :

    <DataSet index="0" file="testxmlpartds/testxmlpartds_1.vti"/>

Contrary to what you suggest, it would then not appear AppendedData tag as high in the description.

mwestphal · February 2, 2023, 1:43pm

What you describe is exactly how it would look like in ASCII mode. Appended is needed in Binary mode.

Jacques-Bernard · February 2, 2023, 2:00pm

Couldn’t that stay within the description of each mesh (formerly in a .vti)?

mwestphal · February 2, 2023, 2:05pm

You can take a look at binary .vti file, they already contain an appended section.

Jacques-Bernard · February 2, 2023, 2:11pm

Precisely… if we copy the content in place of the link given by a file name, it should not end up outside this section.

mwestphal · February 2, 2023, 2:16pm

I dont think we want to mix the binary appended part inside the XML part though.

Sebastien_Jourdain · February 3, 2023, 4:39pm

In general I would not try to monkey with the XML readers/writers but investigate if we could rather leverage other formats (Adios, fides, …) to achieve that merge goal while keeping it efficient.

Jacques-Bernard · February 6, 2023, 7:46am

Sébastien, if you know an ASCII file format other than VTK XML ASCII and understood by VTK/PV we could take a closer look.
Of course, this need for ease via a well-described ASCII format also concerns research codes written by numericians or physicists, well established or trainee / PhD student. Regarding the inevitable loss of IO efficient, in these cases, as far as they are concerned, it is really not their primary concern. These are not codes or uses that are intended to go into “production”.

mwestphal · February 6, 2023, 7:59am

After giving it some thoughts, here is what I think about this proposal.

Trying to improve a format by considering ASCII first is not the right way to go. ASCII will always be slower than anything else and should not be the main driver of a mixed ASCII/BINARY format like VTKXML.
VTKXML is an old codebased and they were lots of issues with adding specific usecases and such. At this point, we could almost consider VTKXML like we consider vtkLegacy. Only add fixes but avoid adding features.

3.While ASCII is nice for prototyping, at the end of the day, we need another container, more efficient, more standardize and less error prone.

That is why I think this proposal is not a great idea and this specific usecase (handling inline or out of file composite data) should be supported by vtkHDF.

Well, actually, it already does, as this feature is builtin in HDF5 ! However, this format is not complete yet and not all type of VTK datasets are supported. Here is the current state: https://vtk.org/doc/nightly/html/VTKHDFFileFormat.html

I know that this proposal is about ASCII but HDF5 is also well known by many communities, research and industrial. Investing money, time and effort in such a format will be beneficial for all in the long run.

Timothee_Chabat · February 7, 2023, 9:19am

My thought is that it depends on what the focus is. For me the only advantage of inlined XML is readability : that means people can “easily” (more or less) develop their own reader/writer if they want to, and potentially read it with their own eyes if they want.

If focus is on performance and ease of use then I’ll go with vtkHDF. Way better format and more recent and efficient code base.