VTKSMP stdthread performance

I’m attempting to generate a bunch of slices from a single polydata using vtkcutter, but while it does parallelize the operation using the vtkSMP API, it doesn’t produce a significant difference vs. single-threaded performance (setting the SMP backend to sequential). It looks like this is due to repeatedly getting the thread-local storage containers in part of the vtkPolyDataPlaneCutter algorithm.

The initial pipline looks something like:

void createSliceAtX(double x … ) {
// “surf” is an actor which has already been created, and whose data we are slicing
vtkPolyData *surfaceData = vtkPolyData::SafeDownCast(surf->GetMapper()->GetInput());

//create a vtkplane at x location and cut the surface
VTKNEW(vtkPlane, plane);
plane->SetOrigin(x, 0.0, 0.0);
plane->SetNormal(1.0, 0.0, 0.0);

vtkSmartPointer cutter = vtkSmartPointer::New();
cutter->SetInputData(surfaceData);
cutter->SetCutFunction(plane);
cutter->Update();
// … (processing the outputs)
}

And this is where its spending most of its time:

I can validate that this is not a performance artifact because if I set the SMP backend to sequential and thread the “createSliceAtX” function separately I see a roughly ~16x performance improvement (using 16 cores)… however the result is unstable, presumably because vtkCutter on it’s own isn’t thread-safe?

perhaps the thread id can be cached? or replaced with std::this_thread::get_id() which is typically speedy? I took a quick look at the current state on vtk’s gitlab and it doesn’t look like anything has changed recently, but if there’s already a PR or issue for this somewhere please point me toward it!

Happy to provide more info if that would be helpful.

I’m using VTK 9.3.1 compiled with msvc 17 (visual studio 2022) on win 11.

Thanks for your report, it’s always good to have profiling of these algorithms.

Glancing at the code for the thread worker’s operator() in vtkPolyDataPlaneCutter, I’m really not sure why the calls to .Local() are done inside the loop over ptId. It certainly looks like these calls could be moved outside of the loop, and that would avoid the per-point locking.

+    unsigned char abovePlane = 0;
+    unsigned char belowPlane = 0;
     for (; ptId < endPtId; ++ptId)
     {
       ...

       // Points above the plane are marked 1; 0 otherwise
       auto val = vtkPlane::Evaluate(n, o, p);
       if (val > 0.0)
       {
         this->PtMap[ptId] = 1;
-        this->AbovePlane.Local() = 1;
+        abovePlane = 1;
       }
       else
       {
         this->PtMap[ptId] = 0;
-        this->BelowPlane.Local() = 1;
+        belowPlane = 1;
       }
     }
+    this->AbovePlane.Local() |= abovePlane;
+    this->BelowPlane.Local() |= belowPlane;

Edit: fixed logic mistake (changed &= to |=).

Thanks for the suggestion! It does seem to make the expected positive difference:

so that the time spend in .Local() is ~2x that of GetSingleThread() which both use GetCallerThreadData()

Unfortunately, the change also seems to introduce a correctness problem where the computed intersection points don’t seem correct in some places. I have a suspicion this has something to do with my environment… I see you already created a merge request (thank you!!), so I’ll try again with a fresh build once that’s merged into master.

I think it might also be worth modifying the STDThread backend’s getStorage() implementation to be lock-free. I tried the following, which should be portable since std::hash always returns size_t and ThreadIdType is also defined as size_t. I’m not sure if anything bad would happen if hash and threadId used the same value, but that’s another potential simplification. (not sure how to get the nice diff format you used :slight_smile: )

StoragePointerType& ThreadSpecific::GetStorage()
{
  auto stdId = std::this_thread::get_id();
  std::hash<std::thread::id> hasher;
  ThreadIdType threadId = hasher(stdId);
  //ThreadIdType threadId = GetThreadId();
  size_t hash = GetHash(threadId);
  ...

It would be more convenient if std::thread::id were directly convertible to an integer, but I’ll assume the standards folks had a reason for not allowing that. This definitely results in some performance improvements (without your suggested change to vtkPolyDataPlaneCutter):

it looks to be ~97% faster.

with your change included the difference is smaller, presumably because there’s less contention on the mutex

but since there are a number of places where SMPtools are used it may still be worthwhile?

Can you verify on your side whether these really are problems in the code, versus issues in your environment? Code that has a chance of being incorrect isn’t merged for the sake of testing, that’s putting the cart before the horse.

For the changes to GetStorage(), it would be great if it could be lock-free but I’ll have to defer to others who are more familiar with the stdthread implementation. I know that it took a lot of time and effort to make the stdthreads backend robust on all platforms, so changes can’t be made lightly.

Starting from a clean copy of vtk 9.3.1 I can confirm that your patch does not cause any correctness issues, at least in my use case.

It looks like the thread ID used to be calculated as std::hash<std::thread::id>{}(std::this_thread::get_id()); but was changed to the current implementation ~2 years ago when the STDThread implementation was re-worked to use a consistent thread pool… that’s probably an indication that the change was intentional, though I don’t see any discussion around that particular change.

Thanks for doing the check. I looked though the VTK codebase and there around a dozen other classes where this GetStorage issue is likely to be severely hurting STDThread performance.

I wonder how TBB does its thread local. With the TBB backend, calling Local() seems to not slow thing down at all (or at least, not significantly). TBB (i.e. oneTBB) is open source but I’ve never looked through their code.

Thank you for both quickly identifying a solution and taking steps to integrate it across the codebase!

TBB is a bit complicated depending on which platform it’s on. You can see the full details here, but tracing back the relevant bits for VTK’s usage:

From Common/Core/SMP/TBB/vtkSMPThreadLocalImpl.h:
T& Local() override { return this->Internal.local(); }
where this->Internal is a tbb::enumerable_thread_specific<T>Internal;

jumping to tbb/enumerable_thread_specific.h, their local() implementation:

    reference local(bool& exists)  {
        void* ptr = this->table_lookup(exists);
        return *(T*)ptr;
    }

past this there are two options, the first, which VTK will use by default (at least with the current TBB codebase) delegates to other implementations of thread-local functions, using either win32 functions or pthreads.

    void* table_lookup( bool& exists ) {
        void* found = get_tls();
        if( found ) {
            exists=true;
        } else {
            found = super::table_lookup(exists);
            set_tls(found);
        }
        return found;
    }
#if _WIN32||_WIN64
#if __TBB_WIN8UI_SUPPORT
    using tls_key_t = DWORD;
    void set_tls(void * value) { FlsSetValue(my_key, (LPVOID)value); }
    void* get_tls() { return (void *)FlsGetValue(my_key); }
#else
    using tls_key_t = DWORD;
    void set_tls(void * value) { TlsSetValue(my_key, (LPVOID)value); }
    void* get_tls() { return (void *)TlsGetValue(my_key); }
#endif
#else
    using tls_key_t = pthread_key_t;
    void set_tls( void * value ) const { pthread_setspecific(my_key, value); }
    void* get_tls() const { return pthread_getspecific(my_key); }
#endif

It’s hard to tell exactly what the win32 APIs are actually doing, but their docs on TLSGetValue() say “TlsGetValue was implemented with speed as the primary goal” - so it probably doesn’t lock anything. Not sure what VTK’s stance on using direct OS calls (or win32 specifically), but perhaps doing something similar is an option?

They do have a non-delegated backend which is similar to what VTK is doing but uses std::thread::id directly for table lookups -

table_lookup() is implemented as

void* ets_base<ETS_key_type>::table_lookup( bool& exists ) {
    const key_type k = ets_key_selector<ETS_key_type>::current_key();

    __TBB_ASSERT(k != key_type(), nullptr);
    void* found;
    std::size_t h = std::hash<key_type>{}(k);
    ...  

and the current_key() is derived from

template <std::size_t ThreadIDSize>
struct internal_ets_key_selector {
    using key_type = std::thread::id;
    static key_type current_key() {
        return std::this_thread::get_id();
    }
};