CloudFlare zlib?

dzenanz · December 23, 2019, 2:23pm

The ITK community has been looking for ways to improve image compression speeds. A low hanging fruit is using CloudFlare fork of zlib, at least as a default option. But as ITK and VTK share CMake-ification of zlib, I wanted to hear your opinions about this before we invest time into it. Any comments?

ben.boeckel · December 23, 2019, 6:02pm

I’d like to see a survey of all the zlib forks out there. I know there is also zlib-ng as well. madler’s zlib is quite unmaintained by now, but I don’t want to set out for one fork without knowing what the rest of the community is doing.

ben.boeckel · December 23, 2019, 6:05pm

One benefit I see of zlib-ng is that they have an open issue tracker. Not having one doesn’t inspire a lot of confidence in the CloudFlare fork.

dzenanz · December 23, 2019, 7:09pm

Yes, zlib-ng has been mentioned in our discussion. And it is even better than CloudFlare’s fork since it is normally using CMake. But you are in agreement that we should update ITK’s and VTK’s zlib?

ben.boeckel · December 23, 2019, 7:35pm

I agree that something needs to be done. I’m concerned that zlib-ng hasn’t made any releases or even tagged the repo yet. Is there an official release from them yet?

That is very useful to know; I strongly prefer zlib-ng given their goal. Now if only they’d bless a release… Downloading zlib-ng using releases · Issue #488 · zlib-ng/zlib-ng · GitHub

lassoan · December 23, 2019, 8:10pm

It would be nice to use a zlib variant that allows random access (e.g., can store some extra indexing information at the end of the steam).

If we are also considering adding more compression formats: allowing RLE-based compression would be great, as it could be magnitudes faster than zlib and for labelmap volumes it could reach similar compression ratio.

pieper · December 23, 2019, 8:22pm

There’s also pigz, which sounds reasonable (see Chris Rorden’s comments).

ben.boeckel · December 23, 2019, 9:42pm

I’d rather add extensions of existing formats as different formats. I don’t want someone filing an issue saying that VTK wrote a zlib-formatted file that can’t be read by gunzip (or VTK refusing to read a gzip-compressed file).

Discussion of new compression formats should go to another thread; this one is about finding a modern upstream source for libz.

As stated in the issue, that is a program, not something we can use (as it is today). Since it is also under the madler user, I suspect it is in a similar state of limbo as libz itself.

Chris_Rorden · December 23, 2019, 10:04pm

From my perspective the main difference between zlib-ng and CloudFlare zlib is the license. The CloudFlare solution uses a few lines of GPL code (crc32-pclmul_asm.S), and therefore is not compatible with many project. These lines of code also make CloudFlare faster than zlib-ng. It should be pointed out that Intel has public domain code to use PCLMULQDQ for crc, so if someone wants the best of both worlds (CloudFlare speed, zlib-ng license) it would not be hard.

Using pigz in piped mode allows you the performance benefit of parallel processing compounded with CloudFlare enhancements with minimal code changes and keeping the GPL code in a separate stand-alone executable.

@lassoan Gzip/zlib is nice because it is ubiquitous, already defined in many file formats. It has aged remarkably well. However, you are correct that typical implementations do not allow fast random access, which would be really useful for many 4D neuroimaging datasets. This page seems relevant. While the engineering investment would be high, at some stage it would be nice to consider a compression format that is offers modern speed/compression (e.g. zStd), offers random access, and is exploits MSB effects for scientific datatypes (e.g. 16, 32, 64 bit). Blosc sounds like a contender.

Chris_Rorden · December 24, 2019, 8:45pm

@ben.boeckel and @dzenanz I have now updated my demonstration project to compile to Windows - both using CloudFlare zlib and using piped pigz. As I demonstrate, a simple compiler directive can detect a Microsoft compiler, and one only needs to change popen to _popen and pclose to _pclose.

You both seem to see the fact that pigz is an application rather than a library as a liability. I would argue it is a tremendous asset. You do not need to complicate your make files, deal with API changes, or deal with a license that may not be compatible with your own code. Using a piped application you get the speed of dedicated library while using previously tested code for uncompressed file saving.

Regardless of preferences, since the existing formats use gz for file compression, I think that pigz is the only available parallel compressor for this format. It is well tested and has a nice license. Given the performance gains, I am an advocate.

dzenanz · December 25, 2019, 5:00pm

Not being a library has many drawbacks. It cannot be bundled into the final executable like static libraries can, so user has only one file to deal with. A full path to pigz is needed, or it needs to be in the path or current working directory. A mechanism for piping is much more complicated to set up in a GUI app than a console app on Windows. From the documentation:

If used in a Windows program, the _popen function returns an invalid file pointer that causes the program to stop responding indefinitely. _popen works properly in a console application. To create a Windows application that redirects input and output, see Creating a Child Process with Redirected Input and Output in the Windows SDK.

ben.boeckel · December 26, 2019, 2:27pm

Licensing applies when code is derived from other code. Whether we’re using the pipe interface or a library call, I’m not convinced that there are zero issues just because you’re not in the same process space. Being a different process is a reasonable guideline, but it doesn’t actually hold any final say in whether code is derived from other code or not. I see it more as if you’re dependent on some functionality for some feature, you have at least some licensing responsibility towards it.

As for the other things, having it be a separate executable is fraught with all the problems listed in the previous post and Windows process communication is not “just” a popen spelling away as pointed out as well.

Chris_Rorden · December 26, 2019, 11:51pm

To be clear, on Windows _popen works properly in a console application like my demo project. As you note, graphical applications can create a child process. You can always have an application fall-back to zlib if the executable is not found (as my example demonstrates).

Since @ben.boeckel suggested that we restrict ourselves to libs compatible compression, I think pigz is the only game in town for parallel compression. I think my demo demonstrates it is pretty easy to integrate across platforms. This is your project, and you need to determine if the benefit to the user outweighs the development and maintenance time required, as well as your own aesthetic for distributing software. For my own projects I have decided to leverage pigz, but I respect your wishes for your own project.

lassoan · December 27, 2019, 12:07am

Process creation is an expensive, complex, highly-OS-dependent mechanism that can fail in many different ways. It would bring in additional dependencies at the file I/O library level, which would be very hard to justify.

Bringing in any GPL code into a project (regardless of what executables it is linked to) leads to many complications, too. You need to keep explaining to various people why there should be no concerns, write justifications, etc. It is just not worth all the trouble.

Pigz seem to use a BSD license, which would allow linking it into any applications, so licensing may not be a concern.

We could add a zlib library interface for pigz. Maybe the author would even accept it into his repository.

pieper · December 27, 2019, 1:15am

@lassoan, as you know of course (because you wrote it) we have an analogous situation with the ScreenCapture module in Slicer, where we expose an interface to ffmpeg, which is GPL in general). I like this solution at the application level, because we don’t distribute ffmpeg, just help the user download and install it if they want to use it.

For a library, VTK has a well established way of providing abstract interfaces that allow applications to provide concrete interfaces that suit their needs. Such a pattern could easily work here, where a concrete implementation could try using pigz if it’s in the path or otherwise configured and fall back to a single threaded library if pigz is not available.

But in general I agree with @Chris_Rorden - my suggestions are provided for informational purposes only. The tradeoffs for VTK are best evaluated by the VTK core developers.

lassoan · December 27, 2019, 2:49am

I agree, launching external processes is OK at application level. This can be a high-level feature in a GUI application; piping multiple processes to implement an end-user workflow; etc.

For many reasons, launching an external process is not a viable option at library/logic/algorithm level.

ben.boeckel · December 27, 2019, 1:20pm

It is LGPL, but upgrades to GPL if you enable certain modules (x264 support being the canonical example). The way we handle this in ParaView is to build it shared and no GPL modules; anyone can then provide their own ffmpeg libraries to the ParaView binaries that they wish (same with Qt). That is sufficient there. If someone wants the GPL module, they’ll have to provide the sources for their resulting distributable to comply (we can’t for the official ParaView binaries as some of the third party projects in the build do not have public sources).

Chris_Rorden · February 28, 2020, 9:47pm

I replaced the GPL code in the CloudFlare zlib with the BSD-like code from the Chromium browser. The pull request was just accepted, so you can now use CloudFlare zlib in place of the default zlib and it is faster on x86-64 and ARM CPUs. Those examples show the benefits for parallel workloads, but they are also present when used as a serial library.

My tests also show that zlib-ng falls between CloudFlare and default zlib in terms of compression speed, but it is faster for decompression. My validation did reveal a bug in the zlib-ng level 1 support, but it looks like patches are being reviewed.

Finally, @ningfei and I have submitted a pull request to pigz that provides Cmake support and supports non Latin characters for Windows. From previous comments I get the feel that the VTK developers do not want to use pigz. Therefore, the more relevant part of that pull request is the simple mechanism for selecting between zlib variants in the CMake script. Our project shows how the project can be built to target a preferred version of zlib (e.g. the system zlib, CloudFlare zlib, or zlib-ng). So that might provide a nice template where you can support the faster CloudFlare zlib now, and then seamlessly move to zlib-ng as it matures.

Chris_Rorden · March 14, 2020, 1:13pm

@lassoan wanted a zlib variant that allowed random access. mgzip is a clever Python implementation, and the method could be adapted to other languages. This takes advantage of the fact that gz allows an extra field and allows multiple compressed streams to be concatenated into a single file. The benefits of this approach are:

mgzip allows parallel compression.
mgzip is able to decompress files it created in parallel.
files created by mgzip can be decompressed by gz compatible tools (though in serial).
one can quickly skip compressed chunks of a file created by mgzip, allowing random access.

There is a slight tradeoff regarding how many chunks to break a file into. More chunks means slightly poorer compression but finer grained random access.

I created the Python script e_test_mgzip.py for my compression benchmark.

ben.boeckel · March 16, 2020, 12:55pm

I think first would be to support such things in VTK readers if the file has that support built in. Questions I have (because I don’t know the answer to them):

Is there an index table in the compressed data stream somewhere? How do I know which hunk has byte N in it to even do random access? How do I know to ask for byte N in the first place? This is essentially a new file format though since the index table would also appear in a standard stream (AFAIK).
What file formats that are compressed that VTK (natively) supports even make sense with random access? (I say natively since things like HDF5 would need to somehow do this through libhdf5 APIs and that’s too much work I expect.)
How do we ensure that we’re not expecting such files to always have this random access? This fallback path seems like it’s going to get ignored in tests and bitrot.

I probably have other questions, but it’s a bit early right now.