Faster method for GetPoint

I figured out that doing this costs a lot time for many calls:

double ptA[3], ptB[3];

pd->GetPoint(poly[i], ptA);
pd->GetPoint(poly[j], ptB);

Is there a more direct method without copying?

This does not work, because after it ptA and ptB points to the same:

double *ptA, *ptB;

ptA = pd->GetPoint(poly[i]);
ptB = pd->GetPoint(poly[j]);

I find this quite annoying, too. According to profiling on Windows, mesh processing filters often spend 10-30% of their time in GetPoint calls (more accurately, vtkAOSDataArrayTemplate<float>::GetTuple). Random memory access can be slow, but somehow I find it hard to believe that this really has to be this slow.

I think the situation has become significantly worse (filters slowed down by about 5-15%) when support for alternative memory layouts were added (e.g., to allow storing point coordinates as x0, x1, x2, …, y0, y1, y2, …, z0, z1, z2, … instead of the default x0, y0, z0, x1, y1, z1, …). It would be nice if memory would be accessed directly when default memory layout is used. I see that the current solution is more elegant and general, but I’m not sure if the cost is justified.

@Ron84, to move things forward, probably we would need some hard data. For example, hack GetPointDirect method that returns pointer into the array and run the same algorithm using this method vs. the original GetPoint method and see what performance improvement is achievable.

1 Like

VTK’s data model has a lot of convenience functions that are useful for prototyping algorithms to produce something that works quickly, but that convenience comes at the cost of performance. GetPoint is one of these “slow but convenient” functions.

The slowness from GetPoint comes from many places:

  • It abstracts access to an array with an arbitrary valuetype – a vtkPoints object is not restricted to use doubles for storage, but may use floats, ints, shorts, char, etc. Even bools will work, for some reason! There is an implied typecast from the storage type to the double type, and checking the valuetype each time GetPoint is called adds significant overhead.
  • In addition to the value type abstraction, the memory layout of the point array isn’t known until runtime, either. The tuples may not be stored contiguously in memory, and redetermining the memory layout for each point further increases cost.
  • These typecasts are performed via either a virtual call, a static dispatch, or both. So these cannot be inlined well by the compiler and will require some branches to execute, adding even more overhead.
  • As you’ve noticed, these require each point’s tuple to be copied. Even the methods that return a pointer must perform an internal copy – there’s no other way to return a double* when the storage is float. More overhead.
  • The internal double buffer is shared and reused by each call to GetPoint. This makes the method fragile and unsafe from threaded code, and can cause subtle bugs like the one you found.

For these reasons, I generally despise these functions in cases where performance matters. My view of them is that they’re great for prototyping but bad for production. Others may disagree, but that’s been my experience after doing a fair bit of perf work in VTK.

To get the most performance from an algorithm, the vtkArrayDispatch and vtk::DataArrayValueRange/vtk::DataArrayTupleRange utilities should be used. vtkArrayDispatch is used to figure out what valuetype + memory layout a vtkDataArray uses, while the range utilities abstract access to an array’s data at compile time. This approach eliminates the perf issues mentioned above:

  • The memory layout and valuetype are determined once, rather than repeatedly looking them up per-point.
  • Templates are used to avoid virtual calls, which allows inlining, type-safety, and direct memory access. There is a single static dispatch, and after that the memory accesses will be optimized by the compiler.
  • No copies necessary – accessing data through the range iterators will optimize to raw memory loads and stores.
  • Thread safe – no hidden shared state.

These use modern techniques and features of C++ that aren’t often seen in VTK, so there can definitely be a learning curve when starting to use them. However, they do provide a massive boost to performance – initial benchmarks revealed a 3x overall speedup when compared to the virtual + typecast equivalents, and algorithms that use them produce the same assembly in tight loops as the equivalent code using raw data pointers. This blog post gives an overview of how they work and some links to more detailed documentation.

4 Likes

As Allison said, if you want performance you need to use the newer dispatch methods coupled with templating etc. The abstractions take a little effort to get your head around, but the performance gains are significant.

There is also the matter of algorithm design which is generally more important then tuning data access. In the old days most everything was serial. Again, rethinking basic approaches and redesigning parallel algorithms can have huge benefits. For example, I have routinely seen 10-100x gains (even a 1000x in one case) using new algorithm designs and the great dispatch methods that Allison introduced. The good news is that this newer coding style is making its way into VTK, so there are more examples to look at.

To help improve performance please specifically identify the filters that are considered too slow. Then the community can prioritize a list and we can work through the list to improve speeds.

3 Likes

Thanks for the very useful insights! It is good to know that there is already a better design available, it “just” has to be applied more widely.

vtkWindowedSincPolyDataFilter would be at the top of my priority list, as it is the performance bottleneck when we convert from image to 3D surface. Image to polydata conversion speed has improved dramatically with flying edges filter, so only smoothing slows things down.

Faster vtkImageReslice would help us a lot, too, but that filter already looks pretty optimized, so I’m not sure if too much can be done.

1 Like

Most excellent thank you. Please keep the list coming.

A comment and question about vtkImageReslice: it is a beast, the swiss army knife of imaging filters with tons of options and permutations etc. Are there one or two workflows that are especially important in your application? We can than either focus on those workflows, or create specialty filters addressing those specific workflows.

@AllisonVacanti Thanks for that detailed answer. I need it for for that class https://github.com/zippy84/vtkbool/blob/test/vtkPolyDataContactFilter.cxx. On very detailed meshes this part of the lib takes about 95% of the execution time. This is not a good rate. You gave a lot of information with your two links. I hope I can improve anything with it.

Any help is welcome :slightly_smiling_face:

Andras I checked out vtkWindowedSincPolyDataFilter this morning. This is a lot of room for improvement, one of the main issues is the memory is being allocated/deleted in small pieces. There are many loops that can be threaded (e.g., normalization, computing Chebyshev coefficients, local analysis of topology on a vertex-by-vertex basis etc.). The algorithm itself is iterative so there is no easy way around that. And yes, there are many virtual GetPoint() and related that could be replaced. The problem is finding the time to do this and track down representative datasets. If you can point me to / send me representative examples it will help us prioritize efforts.

I’m not actively working on VTK anymore, so I’m a bit rusty and there may be some errors in this, but here’s an example that’s similar to what I’m seeing in that filter.

Both foo and bar take a vtkPolyData* and a num. They iterate through the first num points, and for each point that has x < 0, it calls the method doSomething().

foo uses the convenience API, while bar uses more efficient APIs for float and double data, while falling back to slower generic APIs for other valuetypes. Using fallbacks like this will ensure that the algorithm will always run regardless of input, and also allows the number of generated “fast-paths” to be limited. This can reduce binary size and compilation time for large algorithms.

////////////////////////////////////////////////////////////////////////////////
// Original code using vtkPolyData::GetPoint:

#include <vtkPolyData.h>

// Look at the points [0,num), and for each point with x < 0, call doSomething()
void foo(vtkPolyData *pd, vtkIdType num)
{
  double pt[3];
  for (vtkIdType i = 0; i < num; ++i)
  {
    pd->GetPoint(i, pt);

    if (pt[0] < 0) // x < 0
    {
      doSomething();
    }
  }
}

////////////////////////////////////////////////////////////////////////////////
// Using range dispatch

#include <vtkArrayDispatch.h>
#include <vtkDataArrayRange.h>

struct bar_impl
{
  template <typename ArrayType>
  void operator()(ArrayType* pointArray, vtkIdType num)
  {
    // Create a range that accesses 3-tuples from pointArray:
    auto points = vtk::DataArrayTupleRange<3>(pointArray);

    for (vtkIdType i = 0; i < num; ++i)
    {
      // Check the x coordinate of the i'th point:
      if (points[i][0] < 0) // x < 0
      {
        doSomething();
      }
    }
  }
};

void bar(vtkPolyData *pd, vtkIdType num)
{
  // Extract the point array from the polydata:
  vtkDataArray *pointArray = pd->GetPoints()->GetData();

  // Create the functor
  bar_impl func;

  // Configure a Dispatcher. Since this is point data, it's most likely going to
  // be a Real value type (e.g. float or double). Create a fast-path for those
  // that will inline memory accesses:
  using Dispatcher = vtkArrayDispatch::DispatchByValueType<vtkArrayDispatch::Reals>;
  // Attempt to execute the fast paths
  if (!Dispatcher::Execute(pointArray, func, num))
  { // If Execute fails, the fast path is not suitable for pointArray.
    // To make sure we still handle these cases, call the functor with the
    // vtkDataArray. The TupleRange in bar_impl will use the slower,
    // generic API for these:
    func(pointArray, num);
  }
}

At first glance, bar appears to be doing a lot more work, but keep in mind that all of this work (and more!) is happening behind-the-scenes of the GetPoint call, except that here it only happens once vs. num times :slight_smile:

Thanks for your reply. In my case I’m iterating over the points of single polygons. How to handle this? At the top there is the foo function. Is this a fallback? You don’t use it in the code below.

At the top there is the foo function. Is this a fallback? You don’t use it in the code below.

foo and bar do the same thing: foo is using the convenience API, while bar is the optimized version of foo.

In my case I’m iterating over the points of single polygons. How to handle this?

VTK has some examples of doing random access work using these APIs. It sounds like you’ll want something similar to what the vtkCellDataToPointData filter does.

vtkCellDataToPointData

This algorithm uses a large functor to implement most of the algorithm: https://gitlab.kitware.com/vtk/vtk/-/blob/master/Filters/Core/vtkCellDataToPointData.cxx#L48-135

It is dispatched once, passing in two arrays, the Spread functor, and several algorithm parameters. Both arrays will have their valuetypes and memory layouts resolved by the dispatch: https://gitlab.kitware.com/vtk/vtk/-/blob/master/Filters/Core/vtkCellDataToPointData.cxx#L488-495

Inside the functor, the destination array is initialized to 0 and tuple ranges for the input and output arrays are created:

    // Both arrays will have the same value type:
    using T = vtk::GetAPIType<SrcArrayT>;

    // zero initialization
    std::fill_n(vtk::DataArrayValueRange(dstarray).begin(), npoints * ncomps, T(0));

    const auto srcTuples = vtk::DataArrayTupleRange(srcarray);
    auto dstTuples = vtk::DataArrayTupleRange(dstarray);

Next, the cells are walked through, scattering each cell’s data into the point array.

      for (vtkIdType cid = 0; cid < ncells; ++cid)
      {
        vtkCell* cell = src->GetCell(cid);
        if (cell->GetCellDimension() >= highestCellDimension)
        {
          const auto srcTuple = srcTuples[cid];
          vtkIdList* pids = cell->GetPointIds();
          for (vtkIdType i = 0, I = pids->GetNumberOfIds(); i < I; ++i)
          {
            const vtkIdType ptId = pids->GetId(i);
            auto dstTuple = dstTuples[ptId];
            // accumulate cell data to point data <==> point_data += cell_data
            std::transform(srcTuple.cbegin(), srcTuple.cend(), dstTuple.cbegin(), dstTuple.begin(),
              std::plus<T>());
          }
        }
      }

Note that this is not the most efficient way to traverse cells these days – see the vtkCellArray docs for more info.

After that, the point data is cleaned up, but that’s not relevant to just doing random access into an array using cell information.

You’ll want do something similar with your point array that encapsulates all of the random lookup operations into a single functor execution. It may take some refactoring of the algorithm to accomplish this. But once in the functor, create a range object and do something like pointTupleRange[cellPtIdx] to do random access using the cell indices.

InsertTuples

For a simpler example of basic random access, vtkDataArray::InsertTuples does a mixed gather and scatter operation to shuffle tuples between two arrays:

Functor: https://gitlab.kitware.com/vtk/vtk/-/blob/master/Common/Core/vtkDataArray.cxx#L303-329
Invocation: https://gitlab.kitware.com/vtk/vtk/-/blob/master/Common/Core/vtkDataArray.cxx#L619-623

1 Like

Thanks a lot for looking into this! I’ve created a new topic with a link to test data and to discuss this further: vtkWindowedSincPolyDataFilter performance improvement

Yes, this is a very complicated filter indeed. We utilize most of its features - that it can operate on a wide range of data types (integer and floating-point, one or more components), of practically any size, apply a (non-linear) transform, and can work faster when reslicing axes are aligned with image axes.

However, there is one critical use case: quick extraction of an image slice. We use this filter for displaying 2D slices of an image volume, so if performance of this could be made any faster that would visible improvement of slice browsing speed. Total display time includes sending the image data to a texture so that it can be rendered as an image plane. This sending to the GPU probably takes considerable time, too, and I’m not sure how much that part is optimized for this use case (very frequent update of texture, without changing its size or data type).

Reslicing on the GPU could be fast, but supporting of any data types, dynamic application of all kinds of non-linear transforms, supporting large images, etc. would be a lot of work to implement on GPU.

@AllisonVacanti on a related note, I was hoping you could provide some guidance.

I am dispatching on a vtk data array, and at the same time I have a second array of exactly the same type, num tuples, num components (see code below). There is no need for double dispatch because the types are always the same.

I am wondering the best way to cast the second array (from newPts), traverse over it, and assign values to it. Basically I am running down these two arrays and assigning values from the first to the second. I’ve fooled around a little bit, but I suspect that you can provide a more efficient and elegant solution.


// Support dispatch of different types
struct InitializePointsWorker
{
  template <typename DataT>
  void operator()(DataT* pts, vtkIdType numPts, vtkPoints *newPts,
                  int normalize, double length, double center[3])
  {
    vtkSMPTools::For(0,numPts, [&](vtkIdType ptId, vtkIdType endPtId) {

      const auto in = vtk::DataArrayTupleRange<3>(pts, ptId, endPtId);
      double x[3];

      for (const auto tuple : in)
      {
        x[0] = static_cast<double>(tuple[0]);
        x[1] = static_cast<double>(tuple[1]);
        x[2] = static_cast<double>(tuple[2]);

        if ( normalize )
        {
          x[0] = (x[0] - center[0]) / length;
          x[1] = (x[1] - center[1]) / length;
          x[2] = (x[2] - center[2]) / length;
        }

        //Now set the value of the new points


      }//for all points

    }); //end lambda

  }
};

// Initialize points including possibly normalizing them.
vtkPoints *InitializePoints(int normalize, vtkPolyData *input,
                            double &length, double center[3])
{
  vtkPoints *inPts = input->GetPoints();
  vtkIdType numPts = inPts->GetNumberOfPoints();
  vtkPoints *newPts = vtkPoints::New();
  newPts->SetDataType(inPts->GetDataType());
  newPts->SetNumberOfPoints(numPts);

  // May need to grab normalization info
  if ( normalize )
  {
    length = input->GetLength();
    input->GetCenter(center);
  }

  using vtkArrayDispatch::Reals;
  using InitializePointsDispatch = vtkArrayDispatch::DispatchByValueType<Reals>;
  InitializePointsWorker initPtsWorker;
  if ( !InitializePointsDispatch::Execute(inPts->GetData(),initPtsWorker,
                                          numPts,newPts,normalize,length,center) )
  { // Fallback to slowpath for other point types
    initPtsWorker(inPts->GetData(),numPts,newPts,normalize,length,center);
  }

  return newPts;
}
1 Like

Hi Will,

In these cases, it’s important to be careful with the distinction between arrays of the same type and arrays of the same value type.

In this snippet, newPts will always construct a default layout AOS array with the same valuetype as inPts – but if inPts contains a user-provided SOA array, then these arrays will not have the same type (only the same valuetype) and a double dispatch is needed.

To handle this case (same valuetype, possibly different memory layout), the vtkArrayDispatch::Dispatch2BySameValueType<vtkArrayDispatch::Reals> dispatcher should be used. When VTK is built with SOA support, it will generate fast-paths for:

Path# inPts newPts
1 AOS,float AOS,float
2 AOS,float SOA,float
3 SOA,float AOS,float
4 SOA,float SOA,float
5 AOS,double AOS,double
6 AOS,double SOA,double
7 SOA,double AOS,double
8 SOA,double SOA,double

When SOA support is off, only paths 1 and 5 will be generated (which turns into a single dispatch plus a downcast).

For an example, this pattern is currently used in vtkPointCloudFilter to accomplish a similar goal.

OTOH, if both arrays are known to have exactly the same valuetype and memory layout, you can just dispatch one of them, pass the other as a vtkDataArray*, and then cast it inside the functor once the full type information is available:

struct Worker
{
  template <typename ArrayType>
  void operator()(ArrayType *array1, vtkDataArray *dataArray2)
  {
    ArrayType *array2 = vtkArrayDownCast<ArrayType>(dataArray2);
    // do work with array1 / array2
  }
};


void doDispatch(vtkDataArray *dataArray1, vtkDataArray *dataArray2)
{
  using Dispatcher = vtkArrayDispatch::DispatchByValueType<vtkArrayDispatch::Reals>;

  // This is safe if and only if both arrays are known to have 
  // exactly the same type.
  if (!Dispatcher::Execute(dataArray1, Worker{}, dataArray2))
  { // Fallback in case valuetype is not `Real`.
    Worker{}(dataArray1, dataArray2);
  }
}
2 Likes

Most excellent, thanks for the insight and clarification. I have found the dispatch mechanism to be an amazing piece of software engineering, plus it really cleans up the code.

1 Like

Thanks :slight_smile: I’m glad you’ve found them useful!

@AllisonVacanti How would you change this code?

void DoSome (vtkPolyData *pd, vtkIdType cellId) {
    vtkIdType num, *poly, i;
    pd->GetCellPoints(cellId, num, poly);
    double p[3];
    for (i = 0; i < num; i++) {
        pd->GetPoint(poly[i], p);
        // do something with it ...
    }
}

I will not iterate over the entire polydata.

This is a common task and should be documented.

Since when is vtkArrayDispatch available?

@Ron84 As I said, you may have to refactor the algorithm while doing perf optimizations. Is DoSome called from a common place, perhaps inside another loop?

Before putting much work into this, I’d run a profiler on the code to see if this particular call to GetPoint is really a significant contributor to the runtime. If it is, then restructure the algorithm to allow those calls to be grouped within a single dispatch. This may take quite a bit of effort, so profile first. If it’s not a significant bottleneck, I wouldn’t worry about it – there’s not much that can be done to improve a solitary GetPoint call.

This is a common task and should be documented.

See the blog post I linked earlier, as well as the blog post linked from that, the examples for dispatches and the examples for ranges, the headers for vtkArrayDispatch and the ranges, and there’s always the actual uses in VTK to learn from. There’s a fair bit of documentation around.

Since when is vtkArrayDispatch available?

I would like to use this post to ask if there is a quicker way of looping through the cells of a vtkUnstructuredGrid object. The method GetCell() might allocate memory and has extra copy operations might slow down the cell accesses.

Thank you!

vtkCellArray::Visit offers the most efficient access to cell array internals in the current VTK. This is essentially a streamlined dispatch that provides a access to a VisitState object, which implements an optimized API for interacting with the raw offset and connectivity arrays. Using the Visit method will not allocate memory. This technique is used heavily throughout the implementation of vtkCellArray itself, so feel free to poke around in the source to see some additional examples.

Also, the vtkCellArray class documentation points to some other (more convenient, less efficient) methods, like vtkCellArrayIterator, etc. All of these methods may allocate memory and will have some degree of abstraction overhead.

You can get the a reference to the unstructured grid’s vtkCellArray by calling vtkUnstructuredGrid::GetCells().

1 Like