Why are the vtkm accelerated filters slower than the native vtk ones?

Hello everyone,

I am testing the acceleration effect of VTK-m and encountered a puzzling issue: the VTKm filter (vtkmContour) takes longer to execute than the native VTK filter (vtkContourFilter), except when I use VTK 9.6.0. Only with VTK 9.6.0 does vtkmContour (using the CUDA backend) run faster than vtkContourFilter. I have tested multiple VTK versions (9.2.6, 9.3.1, 9.4.2, 9.5.2, and 9.6.0) and observed the same behavior: only version 9.6.0 shows the expected speedup.

Below is a minimal example that reproduces the issue. The code generates a synthetic dataset (vtkRTAnalyticSource), extracts an isosurface using both filters, and prints the execution times.

#include <vtkActor.h>
#include <vtkContourFilter.h>
#include <vtkNamedColors.h>
#include <vtkPolyDataMapper.h>
#include <vtkProperty.h>
#include <vtkRTAnalyticSource.h>
#include <vtkRenderWindow.h>
#include <vtkRenderWindowInteractor.h>
#include <vtkRenderer.h>
#include <vtkTimerLog.h>
#include <vtkVersion.h>
#include <vtkmContour.h>

#include <vtkm/cont/Initialize.h>
#include <vtkm/cont/RuntimeDeviceInformation.h>
#include <vtkm/cont/DeviceAdapterTag.h>
#include <vtkm/cont/Logging.h>
#include <vtkm/cont/RuntimeDeviceTracker.h>

void compareContour() {
  auto src = vtkSmartPointer<vtkRTAnalyticSource>::New();
  src->SetWholeExtent(-128, 128, -128, 128, -128, 128);
  src->Update();

  // Native VTK contour
  auto t0 = vtkSmartPointer<vtkTimerLog>::New();
  t0->StartTimer();
  auto cf = vtkSmartPointer<vtkContourFilter>::New();
  cf->SetInputConnection(src->GetOutputPort());
  cf->SetValue(0, 200.0);
  cf->Update();
  t0->StopTimer();
  std::cout << "Contour Filter: " << t0->GetElapsedTime() << " s\n";

  // VTKm contour (accelerated)
  auto t1 = vtkSmartPointer<vtkTimerLog>::New();
  t1->StartTimer();
  auto mc = vtkSmartPointer<vtkmContour>::New();
  mc->SetInputConnection(src->GetOutputPort());
  mc->SetValue(0, 200.0);
  mc->SetComputeNormals(true);
  mc->Update();
  t1->StopTimer();
  std::cout << "vtkmContour: " << t1->GetElapsedTime() << " s\n";

  // Visualisation (optional)
  auto colors = vtkSmartPointer<vtkNamedColors>::New();

  auto m1 = vtkSmartPointer<vtkPolyDataMapper>::New();
  m1->SetInputConnection(cf->GetOutputPort());
  auto a1 = vtkSmartPointer<vtkActor>::New();
  a1->SetMapper(m1);
  a1->GetProperty()->SetColor(colors->GetColor3d("Tomato").GetData());
  a1->SetPosition(-150.0, 0.0, 0.0);

  auto m2 = vtkSmartPointer<vtkPolyDataMapper>::New();
  m2->SetInputConnection(mc->GetOutputPort());
  auto a2 = vtkSmartPointer<vtkActor>::New();
  a2->SetMapper(m2);
  a2->GetProperty()->SetColor(colors->GetColor3d("Banana").GetData());
  a2->SetPosition(150.0, 0.0, 0.0);

  auto ren = vtkSmartPointer<vtkRenderer>::New();
  ren->AddActor(a1);
  ren->AddActor(a2);
  ren->SetBackground(colors->GetColor3d("SlateGray").GetData());

  auto win = vtkSmartPointer<vtkRenderWindow>::New();
  win->AddRenderer(ren);
  win->SetSize(1200, 600);

  auto iren = vtkSmartPointer<vtkRenderWindowInteractor>::New();
  iren->SetRenderWindow(win);

  win->Render();
  iren->Start();
}

int main(int argc, char* argv[]) {
  std::cout << "VTK full version: " << vtkVersion::GetVTKVersion() << std::endl;

  auto initResult = vtkm::cont::Initialize(argc, argv, vtkm::cont::InitializeOptions::RequireDevice);
  auto device = initResult.Device;
  vtkm::cont::SetStderrLogLevel(vtkm::cont::LogLevel::Info);

  vtkm::cont::RuntimeDeviceInformation runtimeDevInfo;
  bool hasCuda = runtimeDevInfo.Exists(vtkm::cont::DeviceAdapterTagCuda());

  if (hasCuda) {
    std::cout << "CUDA device detected. The program will attempt to use GPU for rendering." << std::endl;
  } else {
    std::cout << "CUDA device not detected. The program may fall back to OpenMP or Serial backend (CPU)." << std::endl;
  }

  auto& tracker = vtkm::cont::GetRuntimeDeviceTracker();
  try {
    tracker.ForceDevice(vtkm::cont::DeviceAdapterTagCuda());
    std::cout << "Forced VTK-m to use CUDA device." << std::endl;
  } catch (const vtkm::cont::Error& e) {
    std::cout << "Failed to force CUDA device: " << e.what() << ", will use default device." << std::endl;
  }

  compareContour();

  return 0;
}

My CMakeLists.txt is as follows:

cmake_minimum_required(VERSION 4.1)
project(VTKmDemo)

set(CMAKE_CXX_STANDARD 17)

message(STATUS "PATH: $ENV{PATH}")
message(STATUS "LD_LIBRARY_PATH: $ENV{LD_LIBRARY_PATH}")
message(STATUS "CUDACXX: $ENV{CUDACXX}")

set(VTK_DIR "/home/xxx/Downloads/VTK_9_4_2/build")  # I change this path for each VTK version

find_package(VTK REQUIRED)

add_executable(VTKmDemo main.cpp)

target_link_libraries(${PROJECT_NAME} PRIVATE ${VTK_LIBRARIES})

The parogram arguments: –vtkm-device=cuda or –viskores-device=cuda

Environment:

  • I manually built VTK versions 9.2.6, 9.3.1, 9.4.2, 9.5.2, and 9.6.0 from source with the same configuration (CMake options: VTK_ENABLE_CUDA=ON , VTK_USE_CUDA=ON , VTK_USE_MPI=OFF , etc.).
  • I have a CUDA-capable GPU (NVIDIA) and the driver + toolkit are properly installed.
  • The code forces the CUDA backend, and the detection message confirms CUDA is available.
  • The timing results (only for 9.6.0) show vtkmContour being faster; for all older versions, the native filter is faster.

Question:
What could cause vtkmContour to be slower than vtkContourFilter in VTK versions prior to 9.6.0? Is it a known issue, a change in the API, a bug, or a configuration problem? Does VTK 9.6.0 include critical fixes or improvements for VTK-m integration? Any insights or suggestions would be greatly appreciated.

There are several unknowns here.

For example, are you running with vtkSMPTools enabled, i.e., VTK_BACKEND_IN_USE is something than Sequential, and how many threads does your test system have? There has been a huge amount of effort in the last few years to thread VTK with vtkSMPTools, and as a result it is not uncommon for the threaded CPU to outperform the GPU depending on the particulars of the hardware. I believe (and I am not an expert) that part of the reason is the movement of data to/from the GPU.

There is also the provenance of the code being run: both vtkm and VTK have been undergoing continual development, and it’s hard to keep track of what changes may have occurred over that versions you mentioned (without a lot of digging). For example, at some point I believe (the threaded) vtkFlyingEdges3D replaced the old VTK contouring filter (under the hood vtkContourFilter may delegate to an internal filter) - this is probably the fastest contouring algorithm that we know of and may help explain part of what you see.

Also I am wondering if the way you are interleaving the two tests: first the VTK CPU, followed by the vtkm, that may cause some timing quirks (due to cache etc effects). I would recommend that you test by first executing one of the tests (to prime the pump so to speak), and then average ~6 executions of that same test to get a final timing. Then do the same for the second test. Since I don’t see any timing numbers, it’s also hard to know variation across tests, it’s not uncommon to see significant variations between parallel executions.