Would 3x3 matrix multiplication be faster on vtk using Laderman's algorithm?

mau_igna_06 · November 19, 2022, 9:20pm

Here is stackoverflow related question:

Please look at the second answer.

Here is the current implementation on vtk:

https://gitlab.kitware.com/vtk/vtk/-/blob/master/Common/Core/vtkMath.cxx#L1610

Thank you for the answer

lassoan · November 21, 2022, 2:33pm

In general, source code need to be optimized for readability first. If you can pinpoint a performance bottleneck then it may be reasonable to do performance optimization, which usually makes the code less readable but if the performance improvement is perceptible then it may worth it. Therefore, to answer this question we would need to see:

source code diff of the old and the proposed new implementation to assess the impact on readability
performance profiling result to see if the improvement may justify the regression in readability

Very often “clever” code runs faster in some environments and slower in others, as seems to be the case with this one, too (see some of the answers in the stackoverflow discussion above). So, evaluating performance impact can be a lot of effort. In the end you may need to add a switch that allows selecting the best implementation for a specific hardware/software environment (see for example the Optimization flag of vtkImageReslice), which further complicates everything.

For all these reasons, performance optimization is rarely if ever driven by availability of clever algorithms, but by performance profiling of important real-world use cases. Performance profiling will also help in determining if improving matrix multiplication speed is really your best option. Most likely you will find that you can achieve much better performance for that particular use case by avoiding the need for those matrix multiplications by caching, reorganizing the code, etc.

will.schroeder · November 21, 2022, 3:15pm

+1 to Andras’s comments.

My experience suggests that by 1) redesigning algorithms, 2) avoiding excessing new/delete, 3) designing efficient API’s, and/or 4) threading routinely produces 5-100x performance gains. Low-level optimization typically produces very modest gains (which may be worth it for important workflows as Andras suggests), at the expense of code complexity. And when we are talking millions of LOC code complexity is a many-headed medusa that I’d rather not face