vtkPolyDataToImageStencil parallelization

chrisadamsonmcri · October 17, 2022, 7:19am

vtkPolyDataToImageStencil processes each slice of an image in serial and is slow when executed many times. There is a TODO: in the RequestData indicating that it would be easy to parallelize the method by processing each slice in parallel, for example.

int vtkPolyDataToImageStencil::RequestData(
vtkInformation* request, vtkInformationVector** inputVector, vtkInformationVector* outputVector)
{
this->Superclass::RequestData(request, inputVector, outputVector);

vtkInformation* outInfo = outputVector->GetInformationObject(0);

vtkImageStencilData* data =
vtkImageStencilData::SafeDownCast(outInfo->Get(vtkDataObject::DATA_OBJECT()));

int extent[6];
data->GetExtent(extent);
// ThreadedExecute is only called from a single thread for
// now, but it could as easily be called from ThreadedRequestData
this->ThreadedExecute(data, extent, 0);

return 1;
}

I have tried to look for examples as a guide how to do this but can’t find anything. Maybe I’m not looking in the right places. Would it be possible to get some pointers to an example of how to split an image by slices and execute a method in parallel. Then the task is to combine the stencils from each slice as a postprocessing task. It would be a huge deal for this method to be mutlithreaded for performance for me.

Anyway, thanks in advance,

Chris.

dgobbi · October 17, 2022, 11:14am

Hi Chris,

It’s not necessary to split the input or to recombine the output. Each worklet can read directly from the input vtkPolyData and write directly to the output vtkImageStencilData. Each slice of the vtkImageStencilData has independent memory allocation, so there will be no write conflicts between the threads.

The only “splitting” that should be needed is the splitting of the extent, which can be done in the worklet itself. For example, let’s say you have the following extent:

[ 0, 255, 0, 255, 0, 49 ]

We want to parallelize over the slices [0,49], so we do a parallel loop:

vtkSMPTools::For(extent[4], extent[5] + 1, worklet);

The worklet class must have an operator () that calls ThreadedExecute:

void operator()(int begin, int end)
{
  int subExtent[6] = { extent[0], extent[1],
                       extent[2], extent[3],
                       begin, end - 1 };
  algorithm->ThreadedExecute(data, subExtent, threadId);
}

chrisadamsonmcri · October 18, 2022, 6:34am

Thank you so much for your suggestion.

I have implemented something that I’m pretty sure works on both test data (the sphere example) and real data. I used VTK 9.2.2 released as the base. I have placed the diffs below. I hacked in a ParallelWorker class and had to make that a friend of the vtkPolyDataToImageStencil class which probably is a bit of a hack but it works. I would like to upload them so you can review and see if it will be useful. Could I get upload permission or is it ok to use something like pastebin?

Thanks again.

dgobbi · October 18, 2022, 7:41am

Smallish diffs can be posted inline between triple backticks:

    ```
    // code or diff here
    ```

The more formal code review process is sharing a topic branch on gitlab.

chrisadamsonmcri · October 18, 2022, 7:51am

Oh ok. Here are the diffs

--- /home/addo/dev/VTK/VTK/Imaging/Stencil/vtkPolyDataToImageStencil.h	2022-10-18 10:30:17.586273841 +1100
+++ vtkPolyDataToImageStencil.h	2022-10-18 16:14:09.815632570 +1100
@@ -65,12 +65,18 @@
 #include "vtkImageStencilSource.h"
 #include "vtkImagingStencilModule.h" // For export macro
 
+#include <iostream>
+
 class vtkMergePoints;
 class vtkDataSet;
 class vtkPolyData;
 
+
+
+
 class VTKIMAGINGSTENCIL_EXPORT vtkPolyDataToImageStencil : public vtkImageStencilSource
 {
+  friend class ParallelWorker;
 public:
   static vtkPolyDataToImageStencil* New();
   vtkTypeMacro(vtkPolyDataToImageStencil, vtkImageStencilSource);
@@ -120,4 +126,35 @@
   void operator=(const vtkPolyDataToImageStencil&) = delete;
 };
 
+class ParallelWorker
+{
+public:
+    ParallelWorker( int extent[6], vtkPolyDataToImageStencil* algorithm, vtkImageStencilData* data )
+    {
+      m_extent[0] = extent[0];
+      m_extent[1] = extent[1];
+      m_extent[2] = extent[2];
+      m_extent[3] = extent[3];
+      m_extent[4] = extent[4];
+      m_extent[5] = extent[5];
+      m_algorithm = algorithm;
+      m_data = data;
+    }
+    ~ParallelWorker(){ }
+
+    void operator()(int begin, int end)
+    {
+        int subExtent[6] = { m_extent[0], m_extent[1],
+                       m_extent[2], m_extent[3],
+                       begin, end - 1 };
+        //std::cout << "begin " << begin << " " << end << std::endl;
+        m_algorithm->ThreadedExecute(m_data, subExtent, 0);
+    }
+
+protected:
+    int m_extent[6];
+    vtkPolyDataToImageStencil* m_algorithm;
+    vtkImageStencilData* m_data;
+};
 #endif
+

--- /home/addo/dev/VTK/VTK/Imaging/Stencil/vtkPolyDataToImageStencil.cxx	2022-10-18 10:30:17.586273841 +1100
+++ vtkPolyDataToImageStencil.cxx	2022-10-18 16:14:09.815632570 +1100
@@ -63,12 +63,15 @@
 #include "vtkSignedCharArray.h"
 #include "vtkStreamingDemandDrivenPipeline.h"
 
+#include "vtkSMPTools.h"
+
 #include <algorithm>
 #include <map>
 #include <utility>
 #include <vector>
 
 #include <cmath>
+#include <iostream>
 
 vtkStandardNewMacro(vtkPolyDataToImageStencil);
 
@@ -457,7 +460,8 @@
   // the spacing and origin of the generated stencil
   double* spacing = data->GetSpacing();
   double* origin = data->GetOrigin();
-
+  //std::cout << "[" << extent[0] << ", " << extent[1] << ", " << extent[2] << ", "
+  //                 << extent[3] << ", " << extent[4] << ", " << extent[5] << "], threadId " << threadId << std::endl;
   // if we have no data then return
   if (!this->GetInput()->GetNumberOfPoints())
   {
@@ -773,7 +777,13 @@
   data->GetExtent(extent);
   // ThreadedExecute is only called from a single thread for
   // now, but it could as easily be called from ThreadedRequestData
-  this->ThreadedExecute(data, extent, 0);
+  ParallelWorker worker( extent, this, data );
+  // this->ThreadedExecute(data, extent, 0);
+  
+  vtkSMPTools::Initialize();
+  vtkSMPTools::SetBackend("TBB");
+  //std::cout << vtkSMPTools::GetEstimatedNumberOfThreads() << std::endl;
+  vtkSMPTools::For(extent[4], extent[5] + 1, 1, worker );
 
   return 1;
 }

dgobbi · October 18, 2022, 12:26pm

Yup, that’s pretty much what the implementation should look like. To remove the need for friend, put the ParallelWorker class definition inside the protected section of vtkPolyDataToImageStencil.

The initialization of vtkSMPTools and the setting of the backend isn’t done within the class itself, that’s something that is done by the application (it actually isn’t necessary to do it at all, the initialization of vtkSMPTools occurs automatically).

To match VTK’s code style, remove the “m_” from member variables and capitalize them instead, e.g.:

-    int m_extent[6];
-    vtkPolyDataToImageStencil* m_algorithm;
-    vtkImageStencilData* m_data;
+    int Extent[6];
+    vtkPolyDataToImageStencil* Algorithm;
+    vtkImageStencilData* Data;

Instead of the m_ prefix, VTK prefixes member-variable access with this to show which variables are members:

-    m_algorithm->ThreadedExecute(m_data, subExtent, 0);
+    this->Algorithm->ThreadedExecute(this->Data, subExtent, 0);

chrisadamsonmcri · October 19, 2022, 2:24am

Thanks David. I’ll make those changes and then put it in for review.