Implement reader for data file containing an unknown amount of datasets

Krzmbrzl · December 22, 2023, 5:47pm

I’m currently working on refactoring the reader for Gaussian cube files in order to address https://gitlab.kitware.com/vtk/vtk/-/issues/18725 and a couple of other shortcomings that I have discovered while going through the current implementation in detail.

The problem I am currently facing is that inside a single cube file I can have multiple datasets - each sharing the same voxel positions. So in other words, the cube file defines a 3D voxel grid and for every voxel it can store N datapoints so that in total we end up with N distinct datasets that just happen to be sampled on the same grid.

Logically, it would make sense to me to have the reader declare that it creates a molecular structure on port zero and then from port one on, it provides the respective datasets. Then we would end up with N + 1 output ports for the reader.
However, to my understanding the reader (or any algorithm, really) is expected to declare the amount of input and output ports in its constructor - aka: before having had the chance of actually looking at the data that it is supposed to read. This makes the dynamic output port amount idea impossible (besides, it would probably pose some challenges in wiring things up in VTK’s execution pipeline).

So that leaves the question of what to do instead?
I have searched a bit and stumbled on the vtkPartitionedDataSet class which seems to be meant to group different data sets together, which sounds what I would need (I declare that my reader will generate such a composite data set and then adjust the number of components as needed while reading the data) until I carefully read the accompanying documentation which seems to indicate that this class is meant to group together data that logically belongs to the same data set but which was broken down into smaller pieces for some reason (e.g. parallelization). This is a relation of the individual data that I do not want to suggest for cube files.

Another alternative would be to just make the reader return a table object (similar to the CSV reader), but that just seems super wasteful as it would effectively always require end users to use an additional filter to convert the columns to gridded data before doing something useful with it.

Surely, I can’t be the first one to encounter a situation such as this and therefore I would like to know how this could be resolved in a way to stays in line with the rest of VTK.

dgobbi · December 22, 2023, 6:43pm

Take this will a grain of salt because I’m not familiar with the Gaussian CUBE file format:

If the datasets are “sharing the same voxel positions,” then why not use one dataset with multiple arrays? Each array could be assigned a unique name. To use the dataset, one could call GetOutput(1), make a shallow copy (let’s call it grid_1), and then call grid_1->SetActiveScalars(array_name).

dgobbi · December 22, 2023, 7:00pm

A correction: it would be

grid_1->GetPointData()->SetActiveScalars(array_name)

Also the filter vtkPassSelectedArrays could probably replace the ShallowCopy as a method of creating a new dataset with only the desired array.

Krzmbrzl · December 22, 2023, 7:10pm

So in other words that would mean creating a dataset that consists of a topology and geometry that defines the voxel grid that was used to sample the properties and then having multiple attribute data which are the different kinds of data that have been sampled?

I think that sounds indeed like the thing I was looking for. I’m still not quite used to thinking of a dataset as having sample points and sampled data as two separate entities…

Now I only have to figure out how to actually produce such a data type in my reader… do you happen to know of an existing reader that does something like this? Then I could peek into its implementation.

Let’s for now assume I have a regular voxel grid such that I will be able to store my data as image data using the vtkImageData class. How would I add multiple attributes to it? Presumably by using the SetActiveAttribute function and relying on it creating a new information object for entries that don’t exist yet and then somehow convincing it to hand me an associated data object? Is that the correct direction?

dgobbi · December 22, 2023, 7:43pm

For each attribute, you could build a vtkDataArray, and then call

array->SetName(attribute_name);
output->GetPointData()->AddArray(array);

You would only want to call SetActiveAttribute() as a last step, once you had added all of the arrays. This would be used to specify which attribute would be the “default” for downstream filters and mappers.

In other words, this wouldn’t be implemented like most of the VTK image readers. So no AllocateScalars() or SetPointDataActiveScalarInfo() that the other image readers use.

The classes it would be closest to are the vtkXMLImageDataReader and its superclass vtkXMLStructuredDataReader. Those are the two classes to read for inspiration.

dgobbi · December 22, 2023, 10:06pm

I didn’t respond to this before because I wasn’t sure what you meant at first:

There would still be only one “grid” data object, and only one “gridInfo” information object. That part of the code wouldn’t change. Just multiple arrays.

Krzmbrzl · December 23, 2023, 8:05pm

Thank you very much! I think I was able to piece my code together with the hints that you have provided.