vtkWindowedSincPolyDataFilter performance improvement

lassoan · May 9, 2020, 8:06pm

This topic is a follow-up of modernization/performance improvement of filters - focusing on vtkWindowedSincPolyDataFilter optimization.

Here are sample VTK polydata data sets that we typically work with. They are created from medical images by thresholding or more sophisticated segmentation methods.

Download link: https://1drv.ms/u/s!Arm_AFxB9yqHu5QrlI-CW9Wi0B1_Rw?e=oUrPxm

Large files: Original segmentations (Bones.vtk, Skin.vtk), and from smooth labelmap (BonesSmoothed.vtk, SkinSmoothed.vtk). Flying edges filter can generate these polydata from images in 1-2 seconds. It would be great if smoothing would not add too much time to this.
Brainatlas: Contains small meshes in a variety of sizes. May be used to test that there is no significant overhead when segmenting many small polydata.

will.schroeder · May 10, 2020, 12:19pm

Great data, thank you. I will slowly start working through this filter in the time I can muster. I will likely ask you to try out a branch periodically. My intuition suggests that one of the biggest issues is the construction of the connectivity which ~ consists of one vtkIdList per point. The time to allocate and deallocate is significant - this needs to be rewritten. Next there are lots of loops that can be threaded, and of course the data access via GetPoint() is slow. Finally, I noted that the output points type is always float, is there any reason not to support the standard VTK ability to support user specification of float or double output, or a default behavior that whatever point type comes in, the same type is created for output?

will.schroeder · May 10, 2020, 3:07pm

Andras please provide me with a simple script, or otherwise provide me with the values you typically use for NumberOfIterations, PassBand value, NormalizeCoordinates flag, FeatureEdgeSmoothing flag, FeatureAngle, EdgeAngle, and the other data members etc. I want to make sure to replicate your typical workflow.

will.schroeder · May 10, 2020, 3:23pm

For example on my 40-thread desktop with the python script below I am seeing ~4.5secs for the Bones.vtk data (which I agree seems slow to me). Are these values consistent with your workflow?

smooth = vtk.vtkWindowedSincPolyDataFilter()
smooth.SetInputConnection(reader.GetOutputPort())
smooth.SetNumberOfIterations(20)
smooth.BoundarySmoothingOn()
smooth.SetFeatureAngle(120)
smooth.SetEdgeAngle(90)
smooth.SetPassBand(0.1)

pieper · May 10, 2020, 3:47pm

Hi Will!

Great to have your digging into this

We use this filter to update the 3D view of Segmentations in Slicer. Users really benefit from real-time updates when they make edits, so speeding this up will have many practical benefits.

This code is probably the hotspot for us:

github.com

Slicer/Slicer/blob/47deb76d7556e40de4e25e585c4b24a63a153da5/Libs/vtkSegmentationCore/vtkBinaryLabelmapToClosedSurfaceConversionRule.cxx#L338-L356


if (smoothingFactor > 0)
  {
  vtkSmartPointer<vtkWindowedSincPolyDataFilter> smoother = vtkSmartPointer<vtkWindowedSincPolyDataFilter>::New();
  smoother->SetInputData(processingResult);
  smoother->SetNumberOfIterations(20); // based on VTK documentation ("Ten or twenty iterations is all the is usually necessary")
  // This formula maps:
  // 0.0  -> 1.0   (almost no smoothing)
  // 0.25 -> 0.1   (average smoothing)
  // 0.5  -> 0.01  (more smoothing)
  // 1.0  -> 0.001 (very strong smoothing)
  double passBand = pow(10.0, -4.0 * smoothingFactor);
  smoother->SetPassBand(passBand);
  smoother->BoundarySmoothingOff();
  smoother->FeatureEdgeSmoothingOff();
  smoother->NonManifoldSmoothingOn();
  smoother->NormalizeCoordinatesOn();
  smoother->Update();
  processingResult = smoother->GetOutput();
  }

will.schroeder · May 13, 2020, 6:23pm

Steve I am curious, I notice that you are using NonManifoldSmoothingOn() (while the boundary and feature edge smoothing is off). Looking into the code, I see that we can realize a significant performance gain if nonmanifold, boundary, and feature edge smoothing are all off. If the input to this filter is coming from flying edges / marching cubes (which use manifold-producing case tables), then there should be no non-manifold meshes output. I am going to add the optimization in and later we can decide if non-manifold smoothing is actually needed. Basically, it means that the BuildLinks() invocation can be skipped, which despite being threaded, still takes a good chunk of time.

BTW as initially written, the vtkWindowedSincPolyDataFilter is inherently serial in the way it modifies data in a co-dependent manner (e.g., when performing topological analysis and building the smoothing edges network). I’m having to completely rewrite these steps for the purposes of threading etc. There are also some suspect decisions made when smoothing polylines - this should be rethought (although since few people use the filter for line smoothing it may not matter).

pieper · May 13, 2020, 6:29pm

Hi Will - very good point - in this use case we should not have non-manifold meshes. Not sure why the flag is on. We definitely won’t have polylines in this pipeline.

will.schroeder · May 15, 2020, 1:08pm

Just a quick status update. I got the full filter framework running: threading, templates, array dispatch, new data structure / algorithm design (to eliminate memory fragmentation and enable threading). Also optimization paths for different types of smoothing (simple, boundary, feature, nonmanifold).

(I’m probably going to be sorry that I am saying the following but here goes…) I am seeing ~20x speedup on my machine at this point (Bones.vtk, my desktop 40 thread machine, highest level of algorithm optimization - no boundary, feature, nonmanifold edges). I expect as we clean this up the speeds will likely drop, but I am now seeing from about ~4.5secs to 0.225secs. Of course, all it takes is one bug to change everything so don’t count on this (yet) but I am hopeful

This is an interesting algorithm to work on because threading, algorithm/data structure redesign, templating / dispatch - each make a big difference. I wish I could say what’s most important but it’s hard to separate out the effects one from another. For example, while threading is important, because I had to use an array of std::atomic, it’s likely that thread-scalability suffers. And there was a tremendous use of GetPoint/SetPoint during iteration so dispatch really cut out a lot of data access overhead. The algorithm itself was decidedly serial and used excessive amounts of fragmented memory…

Anyway I need to come up for air for a little while as next week is very demanding. I’ll keep poking at the algorithm when I can and push a MR in a week or two for you to start testing. I also want to do a little benchmarking etc. moving forwards.

will.schroeder · May 15, 2020, 1:16pm

Before I leave this topic I hope it’s clear that there are so many performance opportunities in VTK - if you are into this stuff it’s like a kid in a candy store :-). When we originally wrote VTK in our spare time as part of a book it was really an exercise in prototyping ideas and code. Since that time there have been steady improvements in implementation, data model, algorithm design, performance, software process and so on. I encourage folks to learn the new stuff (like dispatch) and clean up some old code - it’s rewarding and fun!

lassoan · May 16, 2020, 3:53pm

This is huge progress, thank you very much.

Wow, this is amazing!

It would incredibly useful if we could make people do this. However, it is much harder to use these complex programming concepts than to do simple serial programming, so it would be important to have good training materials. Blogs, reference manuals, etc. are all somewhat useful but it would be important to add 1-2 new VTK textbook chapters that describe these.

lassoan · March 2, 2021, 2:04pm

Update: 3D Slicer has switched to VTK9 now and we had a chance to test the optimized filter extensively. The result is quite shocking: computation is indeed about 20-25 faster on various geometries, tested on different computers. Surface smoothing was the main reason for slow 3D display updates of segmentations, and now the computation time is negligible.

@will.schroeder If you think such kind of performance increase is achievable for other classes as well then more developers should be trained in the techniques you applied.

will.schroeder · March 2, 2021, 2:16pm

Thanks for the update, it’s good to hear that it is working well for Slicer.

It would be fun to write something like a “practical guide to threaded programming in VTK” etc. Especially if we do it with humor, and using a pragmatic take on various programming approaches.

rbumm · June 30, 2021, 3:49pm

Dear @will.schroeder,

It was mentioned during the 35th NA-MIC Project Week that you are the creator of the “Grow From Seeds” functionality within 3D Slicer.

I just wanted to say thank you.

This is an amazing segmentation tool that I use almost every day now for the segmentation of lung cancer and COVID-19 datasets.

Great work and thanks again !

Kind regards

Prof. Rudolf Bumm, MD
Department of Surgery
Kantonsspital Graubünden
Chur
Switzerland

will.schroeder · June 30, 2021, 3:53pm

I’m glad it’s working out for you. Thanks for the feedback,

lassoan · June 30, 2021, 3:58pm

Just to give a bit of more context: vtkWindowedSincPolyDataFilter is used in region growing and other interactive segmentation tools in Slicer for generating smoothed surface meshes for 3D visualization. Quick updates make a huge difference here, because the user needs to wait for the update after each paint stroke or other adjustment.

will.schroeder · June 30, 2021, 4:48pm

BTW if there are any filters (or other operations) that are too slow, it’d be good to hear about them. No promises, but we keep a running list and try to address the more impactful filters.