Why
When dealing with massive datasets, you need to use distributed algorithms to parallelize post-processing. When doing so, the data is distributed across multiple machines and, as such, each process has no knowledge of the data on other processes. For some types of algorithms, neighborhood information is needed to correctly perform processing operations and the concept of “ghost” data appears.
Ghost data is duplicate data that exists on several processes simultaneously in order to provide neighboring information at partition interfaces when needed. One cell might have several ghost instances but there will always be one process that will conserve its ownership and be able to uniquely update its value on all other processes.
When applying changes in the pipeline, it would be useful to know which process owns which ghost entity so as to enable optimized asynchronous communication during data synchronization.
Use case
Let’s say we have some massive dataset in a pipeline needing ghost cells.
When there is a pipeline change, data associated with the ghost cells can be out of date.
Currently, the only solution is to regenerate ghost cells → that’s expensive.
With better knowledge of the distribution of ghost data, we could just update ghost cell data by making requests directly to the processes that own them → much cheaper.
How
The idea behind this feature request is to add a supplemental data attribute that would store the owner processIDs for each geometrical entity.
As such, the proposal is to add a PROCESSIDS
data array reference in the vtkDataSetAttributes
class dedicated to pointing to this special data array. Similarly to the SCALARS
or VECTORS
attributes, vtkDataArray* vtkDataSetAttributes::GetAttribute(int attributeType)
would be able to return the array using the new attribute type added to the enum (link).
Filters generating arrays like GenerateIDs
could be augmented with ProcessIds array generation and filters generating this array like vtkProcessIdScalars
would be updated to use this new attribute. Filters needing this array could generate it dynamically.
Feedback
Any feedback about how this should be done would be appreciated. This is just an overall idea of the solution. Other approaches could be used if justified. Thank you for your feedback! =)