Best practice for rapidly changing vtkImageData?

harold.absalom · July 10, 2020, 6:43am

I’m writing an application which takes a 3D vtkImageData* from an external source, and then displays a number of slices (at least ten) from that data. The external data is updating at a fairly rapid rate (around 30 times a second), and I would like to try and keep my slices up to date with that, but I’m having difficulty rendering that fast.

What are the best practices for handling this kind of short-lifetime, rapidly changing data, in VTK?

dgobbi · July 10, 2020, 2:35pm

The description of your display is too vague. Are you displaying the N slices side-by-side, like a lightbox view? Are you viewing the N slices as a slab, e.g. with volume rendering, or as a MIP? Are you reslicing the data to do a multi-planar view?

The approach that I would recommend is creating your own vtkImageData object and getting a pointer to its internal buffer as shown in this reply. Then you can use e.g. memcpy() or a similarly efficient mechanism to copy slices from your external source into the vtkImageData.

Alternatively, if you have direct access to the buffer where your external source stores its data, and if that buffer is contiguous, then it is possible create a vtkImageData object that directly “wraps” that buffer and makes it available to the VTK pipeline.

harold.absalom · July 13, 2020, 12:40am

Thank you for replying.
The slices are taken as the XY planes at five points equally spaced along the Z axis, and similarly for five equidistant points along the X axis (for YZ planes), as well as a MIP along the Y axis. I’m not sure if it’s relevant, but I’m also displaying the whole volume (as a vtkVolume), which seems to render quite quickly.

Before I connect this up to that external data source, I’m trying to get the timing down on the rendering part. Right now, I’m just prototyping this by loading a 4D file, and slicing 3D volumes out of it to pass to the rendering section of the program. I can get this full pipeline displaying volumes at the required rate, but once I start adding in slices of those volumes, my framerate plummets. So, my question is more about if there’s a way to make multiple reslice filters work together in a more efficient way. My guess is that there’s a bit of copying/caching of data going on, that isn’t necessary, since the pipeline is only being executed once for each instance of the vtkImageData.

I was doing all of my slicing with vtkImageReslice, but have also tried vtkImageSliceMapper, which (to my understanding) should run on the GPU. I have a decent GPU on my machine, which I can see is still being underutilised by my pipeline. While I would still like to take advantage of the extra features offered by vtkImageReslice (thick slices, for instance), my first priority is getting this running as fast as possible.

Hopefully this information should give you a little more insight into my problem.

dgobbi · July 13, 2020, 1:29am

The vtkImageSliceMapper only loads one slice onto the GPU at a time, i.e. it loads the slice that it is displaying. The way that VTK data streaming works is that the mapper will request that the pipeline updates the slice (or slab) that is needed for display. Problems can occur if multiple mappers (or, similarly, multiple instances of vtkImageReslice) request updates of different slabs (or slices) from the upstream pipeline. This can cause unnecessary re-execution of the upstream filters. A way to avoid this is to perform an Update() on the upstream pipeline yourself, before the individual mappers do their updates.

Have you run your code through a profiler to see if any VTK methods are being called more often than you would expect? Or, if you are familiar with VTK observers, you can observe the StartEvent on filters or mappers to see if they are executing unexpectedly.

My main concern is “loading a 4D file, and slicing 3D volumes out of it”. If you’re doing that part with VTK readers and filters, and if something is causing any of them to re-execute when they shouldn’t, then efficiency will fall through the floor.

harold.absalom · July 13, 2020, 2:43am

Does this mean that vtkImageSliceMapper still does the actual slicing on the CPU?

I’ve added observers to the StartEvent of both the vtkImageSliceMapper, as well as trying it on my old code with the vtkImageReslice, and they both appear to be executing once-per-frame-per-filter (as in, for the ten slices, I see ten total StartEvents per frame).

Profiling has been a little tricky, because this all ends up getting rendered onto a WinForms Control. I know it’s well outside the scope of this forum, but if you have any hints on the best way to profile mixed managed/unmanaged C++, I would be happy to hear them. The best I’ve been able to do is to isolate parts of the program, and then take the average time to run that part in a loop.

I do not believe the 4D -> 3D translation is happening more often than expected. In an attempt to emulate the final pipeline (where that part won’t exist), I was outputting a vtkImageData* from there, and then using SetInputData() on my various vtkImageSliceMappers/vtkImageReslices. I was concerned that that might lead to the different filters not sharing the data properly, so I have changed that to a vtkPassThroughFilter. The 4D -> 3D part of the pipeline is calling SetInputData() on the vtkPassThroughFilter, which is then being connected to the slicing filters. I’ve observed the StartEvent of this filter, too (both before and after adding a call to Update(), before any of the slices access it), and it also appears to only be running once per frame.

lassoan · July 13, 2020, 1:35pm

In 3D Slicer, we do a lot of 2D+t and 3D+t image sequence visualization in slice views and volume rendering, using the approach that @dgobbi described above (load the entire sequence in host memory, set up the visualization pipeline, and just update the image voxels using memcpy).

If you choose this technique then make sure you use TBB, because overhead of creating threads for extracting a single slice using image reslice filter is enormous. The improvement is particularly dramatic if you use a CPU with many cores and discrete NVidia GPU (probably because creation of dozens of processing threads per second confuses Nvidia’s threaded optimization heuristics).

On a desktop PC, for a 256^3 volume, we can reslice and display dozens of slices at 30fps (our view refresh rate). If we add rendering of the volume in one view, then rendering drops to 26fps, if we render in a second view as well then we can update all views at about 23fps.

We chose to use this approach due to its simplicity and flexibility. However, we did some feasibility tests and confirmed that we can get 100+ fps volume rendering by uploading the entire 4D volume to the GPU and use one of these methods:

“filmstrip” technique: use a single actor, concatenate all 3D volumes into one large 3D volume along a chosen axis, set up clipping planes to show only a single volume, and switch between time points by changing the origin of the actor
multi-actor technique: add an actor for each 3D volume and switch between time points by changing visibility of actors (always show only one actor at a time)

When we showed these technique to clinicians, rendering was so fast that they asked us to please slow it down. This was the first time ever I heard clinicians complaining about volume rendering being too fast.

Yes, extraction of a slice is done on the CPU and only the necessary slice is sent over to the GPU. This may be faster than transferring the entire volume to the GPU at each time point. However, if you need to do volume rendering then you need to transfer the volume anyway and so then it would be faster to reslice in the GPU. We created a set of VTK classes that allows running part of the display pipeline on the GPU, but there was no much interest from the VTK community, so we did not invested too much into this idea further.

Profiling of this is hard, even in plain C++ environment. We see a lot of time spent in various threads of the graphics driver and in system calls. There is no obvious bottleneck in VTK.

dgobbi · July 13, 2020, 4:01pm

The vtkImageSliceMapper has not been updated for TBB, it still relies on the old vtkMultiThreader for colormapping and window/level. So if the image is small, the cost of thread creation might outweigh the benefit of SMP. The vtkImageSliceMapper::SetNumberOfThreads(n) method can be used to restrict the number of threads that are created, and n=1 will disable the multithreading.

lassoan · July 13, 2020, 4:18pm

Does vtkImageReslice filter take advantage of new SMP backends? (we use that for reslicing and display the texture using a standard polydata mapper)

dgobbi · July 13, 2020, 4:28pm

Yes, vtkImageReslice will take advantage of TBB, as will all imaging filters based on vtkThreadedImageAlgorithm. If you are going the vtkTexture route, consider using vtkImageResliceToColors to generate color scalars for vtkTexture.

Michael · July 15, 2020, 3:16pm

You can try to use the new vtkVolumeMapper::SetBlendModeToSlice to generate slices directly on the GPU and modify the plane at 100+ fps. It works with any plane (not only axis aligned plane).
You just have to modify the plane with vtkVolumeProperty::SetSliceFunction.
The drawback is that the entire volume is stored in the video memory so depending on the volume size, this might be an issue.
Also, the slices are not generated in main CPU memory so you cannot apply a new filter on it, this can only be used for display purposes.

You can see the feature here: https://gitlab.kitware.com/paraview/paraview/-/merge_requests/3652
An example can be found in the VTK test: Rendering/VolumeOpenGL2/Testing/Cxx/TestGPURayCastSlicePlane.cxx

harold.absalom · July 16, 2020, 6:02am

Wow, there’s a lot of great ideas here! Since the volume data I’m working with is so small, @dgobbi’s suggestion to disable multithreading did wonders, and was very easy to implement. If I need to do this with larger images in the future, though, I’m glad to know there are a lot of other ways to make this fast.
Thank you everyone, for your help