Why are the vtkm accelerated filters slower than the native vtk ones?

Hello everyone,

I am testing the acceleration effect of VTK-m and encountered a puzzling issue: the VTKm filter (vtkmContour) takes longer to execute than the native VTK filter (vtkContourFilter), except when I use VTK 9.6.0. Only with VTK 9.6.0 does vtkmContour (using the CUDA backend) run faster than vtkContourFilter. I have tested multiple VTK versions (9.2.6, 9.3.1, 9.4.2, 9.5.2, and 9.6.0) and observed the same behavior: only version 9.6.0 shows the expected speedup.

Below is a minimal example that reproduces the issue. The code generates a synthetic dataset (vtkRTAnalyticSource), extracts an isosurface using both filters, and prints the execution times.

#include <vtkActor.h>
#include <vtkContourFilter.h>
#include <vtkNamedColors.h>
#include <vtkPolyDataMapper.h>
#include <vtkProperty.h>
#include <vtkRTAnalyticSource.h>
#include <vtkRenderWindow.h>
#include <vtkRenderWindowInteractor.h>
#include <vtkRenderer.h>
#include <vtkTimerLog.h>
#include <vtkVersion.h>
#include <vtkmContour.h>

#include <vtkm/cont/Initialize.h>
#include <vtkm/cont/RuntimeDeviceInformation.h>
#include <vtkm/cont/DeviceAdapterTag.h>
#include <vtkm/cont/Logging.h>
#include <vtkm/cont/RuntimeDeviceTracker.h>

void compareContour() {
  auto src = vtkSmartPointer<vtkRTAnalyticSource>::New();
  src->SetWholeExtent(-128, 128, -128, 128, -128, 128);
  src->Update();

  // Native VTK contour
  auto t0 = vtkSmartPointer<vtkTimerLog>::New();
  t0->StartTimer();
  auto cf = vtkSmartPointer<vtkContourFilter>::New();
  cf->SetInputConnection(src->GetOutputPort());
  cf->SetValue(0, 200.0);
  cf->Update();
  t0->StopTimer();
  std::cout << "Contour Filter: " << t0->GetElapsedTime() << " s\n";

  // VTKm contour (accelerated)
  auto t1 = vtkSmartPointer<vtkTimerLog>::New();
  t1->StartTimer();
  auto mc = vtkSmartPointer<vtkmContour>::New();
  mc->SetInputConnection(src->GetOutputPort());
  mc->SetValue(0, 200.0);
  mc->SetComputeNormals(true);
  mc->Update();
  t1->StopTimer();
  std::cout << "vtkmContour: " << t1->GetElapsedTime() << " s\n";

  // Visualisation (optional)
  auto colors = vtkSmartPointer<vtkNamedColors>::New();

  auto m1 = vtkSmartPointer<vtkPolyDataMapper>::New();
  m1->SetInputConnection(cf->GetOutputPort());
  auto a1 = vtkSmartPointer<vtkActor>::New();
  a1->SetMapper(m1);
  a1->GetProperty()->SetColor(colors->GetColor3d("Tomato").GetData());
  a1->SetPosition(-150.0, 0.0, 0.0);

  auto m2 = vtkSmartPointer<vtkPolyDataMapper>::New();
  m2->SetInputConnection(mc->GetOutputPort());
  auto a2 = vtkSmartPointer<vtkActor>::New();
  a2->SetMapper(m2);
  a2->GetProperty()->SetColor(colors->GetColor3d("Banana").GetData());
  a2->SetPosition(150.0, 0.0, 0.0);

  auto ren = vtkSmartPointer<vtkRenderer>::New();
  ren->AddActor(a1);
  ren->AddActor(a2);
  ren->SetBackground(colors->GetColor3d("SlateGray").GetData());

  auto win = vtkSmartPointer<vtkRenderWindow>::New();
  win->AddRenderer(ren);
  win->SetSize(1200, 600);

  auto iren = vtkSmartPointer<vtkRenderWindowInteractor>::New();
  iren->SetRenderWindow(win);

  win->Render();
  iren->Start();
}

int main(int argc, char* argv[]) {
  std::cout << "VTK full version: " << vtkVersion::GetVTKVersion() << std::endl;

  auto initResult = vtkm::cont::Initialize(argc, argv, vtkm::cont::InitializeOptions::RequireDevice);
  auto device = initResult.Device;
  vtkm::cont::SetStderrLogLevel(vtkm::cont::LogLevel::Info);

  vtkm::cont::RuntimeDeviceInformation runtimeDevInfo;
  bool hasCuda = runtimeDevInfo.Exists(vtkm::cont::DeviceAdapterTagCuda());

  if (hasCuda) {
    std::cout << "CUDA device detected. The program will attempt to use GPU for rendering." << std::endl;
  } else {
    std::cout << "CUDA device not detected. The program may fall back to OpenMP or Serial backend (CPU)." << std::endl;
  }

  auto& tracker = vtkm::cont::GetRuntimeDeviceTracker();
  try {
    tracker.ForceDevice(vtkm::cont::DeviceAdapterTagCuda());
    std::cout << "Forced VTK-m to use CUDA device." << std::endl;
  } catch (const vtkm::cont::Error& e) {
    std::cout << "Failed to force CUDA device: " << e.what() << ", will use default device." << std::endl;
  }

  compareContour();

  return 0;
}

My CMakeLists.txt is as follows:

cmake_minimum_required(VERSION 4.1)
project(VTKmDemo)

set(CMAKE_CXX_STANDARD 17)

message(STATUS "PATH: $ENV{PATH}")
message(STATUS "LD_LIBRARY_PATH: $ENV{LD_LIBRARY_PATH}")
message(STATUS "CUDACXX: $ENV{CUDACXX}")

set(VTK_DIR "/home/xxx/Downloads/VTK_9_4_2/build")  # I change this path for each VTK version

find_package(VTK REQUIRED)

add_executable(VTKmDemo main.cpp)

target_link_libraries(${PROJECT_NAME} PRIVATE ${VTK_LIBRARIES})

The parogram arguments: –vtkm-device=cuda or –viskores-device=cuda

Environment:

  • I manually built VTK versions 9.2.6, 9.3.1, 9.4.2, 9.5.2, and 9.6.0 from source with the same configuration (CMake options: VTK_ENABLE_CUDA=ON , VTK_USE_CUDA=ON , VTK_USE_MPI=OFF , etc.).
  • I have a CUDA-capable GPU (NVIDIA) and the driver + toolkit are properly installed.
  • The code forces the CUDA backend, and the detection message confirms CUDA is available.
  • The timing results (only for 9.6.0) show vtkmContour being faster; for all older versions, the native filter is faster.

Question:
What could cause vtkmContour to be slower than vtkContourFilter in VTK versions prior to 9.6.0? Is it a known issue, a change in the API, a bug, or a configuration problem? Does VTK 9.6.0 include critical fixes or improvements for VTK-m integration? Any insights or suggestions would be greatly appreciated.

There are several unknowns here.

For example, are you running with vtkSMPTools enabled, i.e., VTK_BACKEND_IN_USE is something than Sequential, and how many threads does your test system have? There has been a huge amount of effort in the last few years to thread VTK with vtkSMPTools, and as a result it is not uncommon for the threaded CPU to outperform the GPU depending on the particulars of the hardware. I believe (and I am not an expert) that part of the reason is the movement of data to/from the GPU.

There is also the provenance of the code being run: both vtkm and VTK have been undergoing continual development, and it’s hard to keep track of what changes may have occurred over that versions you mentioned (without a lot of digging). For example, at some point I believe (the threaded) vtkFlyingEdges3D replaced the old VTK contouring filter (under the hood vtkContourFilter may delegate to an internal filter) - this is probably the fastest contouring algorithm that we know of and may help explain part of what you see.

Also I am wondering if the way you are interleaving the two tests: first the VTK CPU, followed by the vtkm, that may cause some timing quirks (due to cache etc effects). I would recommend that you test by first executing one of the tests (to prime the pump so to speak), and then average ~6 executions of that same test to get a final timing. Then do the same for the second test. Since I don’t see any timing numbers, it’s also hard to know variation across tests, it’s not uncommon to see significant variations between parallel executions.

1 Like

My compilation configuration is as follows: VTK_SMP_IMPLEMENTATION_TYPE:STRING=Sequential

Using your separate timing method:

double benchmarkFilter(vtkAlgorithm* filter, int warmup = 1, int runs = 6) {
    auto timer = vtkSmartPointer<vtkTimerLog>::New();
    std::vector<double> times;
    times.reserve(runs + warmup);
    for (int i = 0; i < warmup; ++i) {
        timer->StartTimer();
        filter->Update();
        timer->StopTimer();
        times.push_back(timer->GetElapsedTime());
    }
    for (int i = 0; i < runs; ++i) {
        timer->StartTimer();
        filter->Update();
        timer->StopTimer();
        times.push_back(timer->GetElapsedTime());
    }
    double sum = std::accumulate(times.begin(), times.end(), 0.0);
    return sum / (runs + warmup);
}

When the data extent of vtkRTAnalyticSource is (-128, 128, -128, 128, -128, 128) :

// src->SetWholeExtent(-128, 128, -128, 128, -128, 128);
  • Contour Filter (avg): 0.037518 s
  • 9.2.6 version: vtkmContour (avg): 0.0388267 s(vtkmContour (0x5555556bd370): VTK-m failed with message: Input dataset/parameters not supported by vtkmContour.)
  • 9.3.1 version: vtkmContour (avg):0.20358 s
  • 9.4.2 version: vtkmContour (avg): 0.203983 s
  • 9.5.2 version: vtkmContour (avg): 0.20087 s
  • 9.6.0 version: vtkmContour (avg): 0.0228374 s

When the data extent of vtkRTAnalyticSource is (-428, 428, -428, 428, -428, 428)

  • Contour Filter (avg): 1.56816 s
  • 9.2.6 version: vtkmContour (avg): 1.19375 s(vtkmContour (0x5555556bd370): VTK-m failed with message: Input dataset/parameters not supported by vtkmContour.)
  • 9.3.1 version: vtkmContour (avg):2.1793 s
  • 9.4.2 version: vtkmContour (avg): 2.2501 s
  • 9.5.2 version: vtkmContour (avg): 2.23708 s
  • 9.6.0 version: vtkmContour (avg): 0.0729707 s

I don’t understand what the question is.

you are asking why vtkmContour got faster or why vtkContourFilter is not as fast?

First of all if you want to compare apples to apples, for vtkContourFilter you need to set FastMode to true, which under the hood uses flying edges.

The reason of why this is off by default can be found in the documentation of the parameter.
vtkmContour uses a flying edges approach by default. so this how you will be able to compare apples to apples.

Now experiment with both VTK_SMP_IMPLEMENTATION_TYPE= Sequential and VTK_SMP_IMPLEMENTATION_TYPE=TBB

vtkmContour got faster in 9.6 because it no longer does an expensive copy from Viskores(VTKm) structures back to VTK, but maintains everything on Viskores structures.

Thanks Spiros, this is what I needed.

Okay I asked a developed who has more experience than I in this area. Here’s a portion of his answer:

So the conclusion is that the previous version needed to copy data from vtkm structures back to vtk, which led to longer running times. I got it, thank you very much. For the previous version, do I still need to use vtkmFilter? In my current work, I need to speed up the running time of certain filters, such as vtkContourFilter, vtkThreshold, vtkCutter, etc. (Currently, I am not using SMP, so VTK_SMP_IMPLEMENTATION_TYPE is still the default Sequential).

I would suggest using vtkm*Filter only if you have a GPU around, otherwise the VTK native filters are either just as fast or faster for CPU.

for vtkContourFilter as i said, use FastModeOn()

as far as VTK_SMP_IMPLEMENTATION_TYPE, don’t use SEQUENTIAL, it’s slow. Use at least VTK_SMP_IMPLEMENTATION_TYPE=STDThread, and if you have OpenMP or TBB please prefer them because they have good load balancing.

1 Like

OK, thanks.