Adding support for opening *any* archived single file

Design:

Some time ago, I shared a potential design to read archived file in ParaView discourse.

Well, this idea never went very far as it required a complex implementation in a third party that only partially existed at the time.

But since this is a topic that comes up often.

I wanted to find a potential solution, and there is actually something that changes in the last few years, streaming support has been added in VTK and implemented in many readers:

While streaming support is in itself a good thing and let VTK be used in many other context than reading data from disk, it also improve a lot what we can do in terms of interoperability with other software and within VTK itself.

Indeed, we could leverage stream support and provide other kind of streams, from different, such as compressed data stream, thus we could imagine doing this:

Single GZip file:

vtkNew<vtkGZipCompressedResourceStream> stream;
stream->Open("/path/to/file.obj.gz");

vtkNew<vtkOBJReader> reader;
reader->SetStream(stream):
reader->Read()

Tarbal:

vtkNew<vtkTGZCompressedResourceStream> stream;
stream->Open("/path/to/archive.tgz", "file.obj");

vtkNew<vtkOBJReader> reader;
reader->SetStream(stream):
reader->Read()

ZIP archive:

vtkNew<vtkZIPCompressedResourceStream> stream;
stream->Open("/path/to/archive.zip", "file.obj");

vtkNew<vtkOBJReader> reader;
reader->SetStream(stream):
reader->Read()

For ZIP and TGZ, the second arg let us select a file from the archive.

Details:

As always, the devil is in the details.

  1. Reading the whole file in memory

Some readers stream implementation require the whole data to be available in momery (as in void*, size_t. Unless such implementation is rewritten to support reading from a proper stream, then the while file will necessarly be copied uncompressed in memory before actually being read.

It will work, but will definitely be memory extensive and show limitations for large files.

  1. Seek support

Most readers implementing stream support use the Seek method to move around the stream, which can be very useful, and, at the moment, this is supported by all types of streams. However, it is possible that certains compression algorithm won’t let us properly implement Seek, which will require to ensure all readers support a Seek-less version, which may not be trivial to implement.

  1. Slice reading

Most readers, as we expect, read the file part by part, we can expect that some compression may not support that and may require to decompress the whole file before being able to read any data

Implementation:

The using a third party to read the compressed file is a must obviously. zlib would be the classic choice but libzip seems to be well placed because it supports many decompression algorithm and allows to read a slice and to seek.

Of course, other more custom compression may require other implementations and third parties.

What are your thoughts on this ?

1 Like

For archives, I think it would be good to support them in the vtkURILoader, as this would allow resolving and loading file indirectly, that could be useful for importers, (I don’t know if any importer supports the vtkURILoader at this time) and readers like the vtkGLTFReader:

vtkNew<vtkURILoader> loader;
loader->SetBaseFileName("gltf.zip");

auto stream = loader->Load("dataset.json"); // load main file manually from zip

vtkNew<vtkGLTFReader> reader;
reader->SetURILoader(loader); // other relative files will be resolved and loaded from the zip
reader->SetStream(stream);

To support them in the URI loader there are different approach, either the base file name must be a zip (last component), or do something higher level by resolving zip in full paths, but that would be tricky.


About Seek support, note that the libzip indicates that compressed archives do not support seeking, and uncompressed ZIP are uncommon.

Indeed, that would be interesting in the context of glTF and we should add URI loader support to other readers/importers as well.

the libzip indicates that compressed archives do not support seeking

Unfortunate, where did you find that info ?

In the link you gave:

The zip_fseek() function seeks to the specified offset relative to whence, just like fseek(3).
zip_fseek only works on uncompressed (stored), unencrypted data. When called on compressed or encrypted data it will return an error.

Next time I will try reading :slight_smile:

Anyway, it indeed shows that the Seek issue I highlight may be critical, unless we find another third party that can indeed Seek.

Why not just read the compressed stream into an uncompressed (buffered) one then Seek() on it.

This is indeed a possible fallback, but not ideal when dealing with larg files.

I suppose that depends on whether you need to seek to the end of the file or just in chunks no larger than the decompressed buffer.

With no further feedback, I’ll design a proper solution.

Here is a quick analysis of potential 3rd parties:

zlib minizip(zlib) libzip libarchive
.gz :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
.xz :no_entry: :no_entry: :white_check_mark: :white_check_mark:
.tar.gz :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
.tar.xz :no_entry: :no_entry: :white_check_mark: :white_check_mark:
.zip :no_entry: :white_check_mark: :white_check_mark: :white_check_mark:
.rar :no_entry: :no_entry: :no_entry: :white_check_mark:
.7z :no_entry: :no_entry: :no_entry: :white_check_mark:
popular :white_check_mark: :white_check_mark: :warning: :white_check_mark:
maintained :white_check_mark: :white_check_mark: :warning: :white_check_mark:
support stream :white_check_mark: :white_check_mark: :no_entry: :white_check_mark:
support seek :no_entry: :no_entry: :no_entry: :no_entry:
in VTK/PV/PVSB :white_check_mark: :warning: :no_entry: :no_entry:

libarchive is the clear winner here! Seek support is just not possible with most compression algorithm so lets consider it not possible. A pure zlib backend would be nice but this doesnt seem necessary after all.

I’ll design assuming libarchive usage.

Here is a tentative API:

​vtkCompressedResourceStream : public vtkResourceStream
{
  SupportSeek () { return false; };
  Read (void *buffer, std::size_t bytes); // impl using libarchive
  EndOfStream(); // impl using libarchive
  Tell(); // impl using libarchive
}


​vtkCompressedFileResourceStream : public vtkCompressedResourceStream
{
  Open(const std::string& archive, const std::string& file) // impl using libarchive
}

​vtkCompressedMemoryResourceStream : public vtkCompressedResourceStream
{
  SetBuffer(const void *buffer, std::size_t size) // impl using libarchive
}

Contrary to vtkMemoryResourceStream, the suport of copying the memory into its own buffer seems not needed for this specialized stream, but it could be added in the future.
The URI loader support suggested by @alexy.pellegrini above is considered outside of the scope at the moment.