New `ProcessIds` data attribute for distributed algorithms

charly_bollinger · May 5, 2023, 8:37am

Why

When dealing with massive datasets, you need to use distributed algorithms to parallelize post-processing. When doing so, the data is distributed across multiple machines and, as such, each process has no knowledge of the data on other processes. For some types of algorithms, neighborhood information is needed to correctly perform processing operations and the concept of “ghost” data appears.

Ghost data is duplicate data that exists on several processes simultaneously in order to provide neighboring information at partition interfaces when needed. One cell might have several ghost instances but there will always be one process that will conserve its ownership and be able to uniquely update its value on all other processes.

When applying changes in the pipeline, it would be useful to know which process owns which ghost entity so as to enable optimized asynchronous communication during data synchronization.

Use case

Let’s say we have some massive dataset in a pipeline needing ghost cells.
When there is a pipeline change, data associated with the ghost cells can be out of date.

Currently, the only solution is to regenerate ghost cells → that’s expensive.
With better knowledge of the distribution of ghost data, we could just update ghost cell data by making requests directly to the processes that own them → much cheaper.

How

The idea behind this feature request is to add a supplemental data attribute that would store the owner processIDs for each geometrical entity.

As such, the proposal is to add a PROCESSIDS data array reference in the vtkDataSetAttributes class dedicated to pointing to this special data array. Similarly to the SCALARS or VECTORS attributes, vtkDataArray* vtkDataSetAttributes::GetAttribute(int attributeType) would be able to return the array using the new attribute type added to the enum (link).

Filters generating arrays like GenerateIDs could be augmented with ProcessIds array generation and filters generating this array like vtkProcessIdScalars would be updated to use this new attribute. Filters needing this array could generate it dynamically.

Feedback

Any feedback about how this should be done would be appreciated. This is just an overall idea of the solution. Other approaches could be used if justified. Thank you for your feedback! =)

charly_bollinger · May 5, 2023, 8:45am

FYI @Jacques-Bernard

dcthomp · May 13, 2023, 1:51am

Does this really need its own vtkDataSetAttributes? Or should we add something to vtkFieldData that lets us mark arrays with a role?

The argument for adding a vtkDataSetAttributes instance is that there will be multiple arrays that all share the same number of tuples.

The argument for adding a role to vtkFieldData is that there will only be a single array with the same number of tuples; in that case, simply storing a map from a vtkStringToken instance to a set of arrays held by the map’s vtkFieldData parent would suffice (possibly with a second map for reverse lookups).

Questions I still have:

Will there be a need for both point PROCESSID and cell PROCESSID arrays? (i.e., should it really be POINT_PROCESSID and CELL_PROCESSID?)
What wiring will need to be done so filters pass these arrays properly? Obviously, filters that don’t touch the corresponding points/cells should just copy the data, but what facilities should we make available for filters that use vtkDataSetAttributes’s methods to copy tuples?

mwestphal · May 15, 2023, 8:36am

Not sure to follow, it does need to be its own AttributesType, yes, which already supports both point and cell unless I’m missing something.

What wiring will need to be done so filters pass these arrays properly?

None, its just an array like any other.

dcthomp · May 15, 2023, 6:38pm

I think I have misunderstood the proposal… I read it as wanting to add a new instance of vtkDataSetAttributes to each vtkDataObject rather than adding a new array role to the vtkDataSetAttributes class.

If that’s the case, my only hesitation is that now users will expect process IDs to be present or easily inferred. If you are proposing writing a general-purpose filter to create process IDs, that is great (although covering poly-data, unstructured grids, composite data, tables, etc. could be a challenge).

mwestphal · May 15, 2023, 7:07pm

if you are proposing writing a general-purpose filter to create process IDs

Yes, it already exists btw. But you have to know the name of the array to use it.

cory.quammen · May 15, 2023, 8:58pm

In my experience, attributes add a lot of pain to VTK. I thought we were aiming to move away from attributes, but this proposal adds another.

mwestphal · May 15, 2023, 9:42pm

We are moving away from SetScalars() and such, but not from attributes afaik.

Edit: nevermind, it is indeed the same concept. What would be the alternative though ?

jfausty · May 16, 2023, 7:36am

In my experience, attributes add a lot of pain to VTK. I thought we were aiming to move away from attributes, but this proposal adds another

I would agree in the context of VECTORS and SCALARS that the API isn’t great since at any given time there is more likely more than one array that could be set in these privileged slots. However, process ids should remain a unique array (for a vtkDataSetAttributes object). With this in mind I don’t find the API too painful.

Is there any other way of doing something similar that I am missing?

charly_bollinger · May 16, 2023, 7:45am

Yes, that’s the proposal! Sorry if I wasn’t clear enough

cory.quammen · May 16, 2023, 2:03pm

The only alternative implementation I’m aware of where an array has special status like we are talking about here are ghost points and cells. We don’t have a ghost attribute per se in a vtkDataSetAttributes but instead designate a fixed array name to indicate the array contains ghost information.

Honestly, I’m not sure if that is better or worse than the attributes approach. I guess my beef with attributes is limited to designating “the” scalars or “the” vectors in dataset, which used to be the way for filters to figure out which arrays they should use to process data. It’s a dated concept that we no longer have to use because we can specify which input arrays to process with SetInputArraysToProces(...).

I’m coming around to using attributes in this case I think the criteria for an attribute should be that there is indeed a unique array that provides the attribute, e.e. GLOBALIDS, PEDIGREEIDS, and in this proposal PROCESS*ID.

We should, however, aim to limit the reliance on Scalars, Vectors, Normals, TCoords, Tensors, and Tangents as attributes IMO. Fully removing them without breaking a bunch of applications may be impossible, however, but we can trend in that direction.

dcthomp · May 16, 2023, 2:09pm

I do think it is useful to mark arrays with roles, but am not sure whether the container should hold the role or the array itself.

For roles that must be unique, it makes sense for the container to ensure there is only one such array. However, for non-unique roles (texture coordinates, etc.), it seems more useful for the array itself to store and propagate this information.

Yohann_Bearzi · May 16, 2023, 5:22pm

@charly_bollinger In which context do you intend to use these process ids to synchronize data? Do you intend to make vtkGhostCellsGenerator use this array if present to move around cell and point data? Or do you intend to use it within a few filters that might require synchronization, and just write a utility synchronizing ghosts? (It should possibly also update the point positions if a filter moves points upstream, assuming that the topology hasn’t changed).

Usually, it doesn’t matter if all the data across ranks is not 100% similar, as long as the amount of required ghost layers to process something are similar across ranks. So if someone worries about not having to recompute ghosts all the time, he needs to compute the required number of ghosts so at the end of the pipeline, all non ghost data has the correct value. If you have a vtkGradient in the pipeline, you should have one layer. If you cascade 2 vtkGradient, you need 2 layers.

In theory there is a key information to propagate the number of ghosts to compute from the source (UPDATE_NUMBER_OF_GHOST_LEVELS()) that should propagate upstream to tell the source how many layers of ghosts are needed downstream. I don’t believe this key is really used, and possibly it would be nice to be able from vtkStreamingDemandDrivenPipeline to invoke vtkGhostCellsGenerator when the source hasn’t generated the requested ghosts. It would let the user not have to worry about how many ghost each filter requires.

Nevertheless, I think that a new assigned array or vtkDataSetAttributes that stores process ids won’t hurt. I’m all for it.

Yohann_Bearzi · May 16, 2023, 5:27pm

@charly_bollinger I have another remark. Ghosts can be computed within one process if it has vtkPartitionedDataSet. Ghosts will be generated across partitions. Do you think it is not worth holding this kind of information because we are within the same rank and won’t need MPI to synchronize, or would knowing that partition id (which could be global) help speed up the kind of filters you had in mind?

jfausty · May 17, 2023, 2:39pm

@Yohann_Bearzi

Do you intend to make vtkGhostCellsGenerator use this array if present to move around cell and point data? Or do you intend to use it within a few filters that might require synchronization, and just write a utility synchronizing ghosts? (It should possibly also update the point positions if a filter moves points upstream, assuming that the topology hasn’t changed).

The idea is definitely to have this array generated when generating ghosts so as to make data synchronization across processes more efficient when needed. So while perhaps not being a direct input to the vtkGhostCellGenerator (since if we don’t have ghosts, we don’t need this array and if we have ghosts, we don’t need to generate them) it will definitely be useful for filters in need of synchronization down the line. As for the exact software architecture to associate to the synchronization routine, we haven’t started designing it yet but any suggestions are welcome!

Ghosts can be computed within one process if it has vtkPartitionedDataSet. Ghosts will be generated across partitions. Do you think it is not worth holding this kind of information because we are within the same rank and won’t need MPI to synchronize, or would knowing that partition id (which could be global) help speed up the kind of filters you had in mind?

This is a great remark. I think that we should generally treat the partitions on-rank and the partitions off-rank the same way to avoid breaking the vtkPartionnedDataSet concept. To this end I would advocate for generating ghost information even between partitions on the same rank even if this might lead to worse performance. This should simplify the distributed algorithms at first. If we identify any bottlenecks, we should be able to take this information into account. The workaround of using an append filter on each rank (such as what is done in the vtkRedistributeDataSetFilter) is also a good practice for this problem as long as the on-rank partitioning doesn’t have any semantic meaning.

charly_bollinger · June 5, 2023, 7:58am

The new ProcessIds attribute for VtkDataSetAttributes has been merged into VTK master since this commit. A new filter has been added in order to fill in the array: vtkGenerateProcessIds.

Work on the vtkGhostCellsGenerator will follow. I will let you know.