New/Customizable way to load data

Hi everyone,

I’m currently working on a new and more generic way to load data for VTK readers.

Current status of data loading in VTK

In VTK, all readers have at least the SetFileName. Some of them such as legacy readers, XML readers, PLY reader, have a function such as SetInputString or similar, while others don’t have such functionality , ex: vtkOBJReader. There is no other way of loading data, afaik.

New approach

The new approach is to expose a “SetInputStream” function that enables the user to specify a std::istream as input for readers.
Most of the reader uses kwsys::fstream internally, and the ones that accept string as input uses std::istringstream, so it does not need a lot of refactor for them.
Exposing this would enable user to use custom streams, which in turn would enable users to extend what is currently supported in VTK.
I already implemented locally this feature for vtkXMLReader.

Resource

Since dealing with standard iostream library can be hard, especially for C++ neophytes, so I made a small PoC of a kwsys module that enables easier resource creation, while staying compatible with C++ iostreams.

The module defines a higher level interface base class, kwsys::Resource which only has read and seek virtual functions, and nothing else. The user can then give this Resource to a kwsys::ResourceStream, which holds a kwsys::ResourceStrembuf. This kwsys::ResourceStream is a std::istream so it can be passed to the new SetInputStream function.

Here is the synopsis of the header:

namespace @KWSYS_NAMESPACE@ {
class Resource {
public:
  virtual ~Resource() = default;
  virtual std::streamsize read(void* buffer, std::streamsize bytes);
  virtual std::streampos seek(std::streampos pos, std::ios_base::seekdir which);
};
template<typename CharT, typename Traits = std::char_traits<CharT>, std::size_t BufferSizeValue = 1024>
class BasicResourceStreambuf : public std::basic_streambuf<CharT, Traits> {
  explicit BasicResourceStreambuf(Resource& resource)
  void SetResource(Resource& resource);
  Resource& GetResource() const;
};
template<typename CharT, typename Traits = std::char_traits<CharT>, std::size_t BufferSizeValue = 1024>
class BasicResourceStream : public std::basic_istream<CharT, Traits> {
  BasicResourceStream();
  explicit BasicResourceStream(Resource& resource)
  void SetResource(Resource& resource);
};
using ResourceStreambuf = BasicResourceStreambuf<char>;
using ResourceStream = BasicResourceStream<char>;
}

Example

Here is an example, that load data from memory, using a MemoryResource which is a kwsys::Resource.

// This is a resource that replaces the SetInputString functions
// Unlike std::streamstring, it is a view, it does not copy the buffer
// buffer is a standard container
vtksys::MemoryResource resource{buffer.data(), buffer.size()}; 
// The resource is only referenced, so it must stay alive until the la Update() of the user
vtksys::ResourceStream stream{resource}; // Assign the resource, the stream will pass it down to the streambuf
// reader is a vtkXMLReader subclass
reader->SetReadFromInputStream(true); // Must be set to use user-provided stream
reader->SetInputStream(stream); // This is the new function, for all vtkXMLReader
reader->Update();

Wrapping

Having a custom class for resource may enable wrappers to offer an interface for having custom resources, in python, C#, Java, … using a similar mechanism as vtkPythonAlgorithm, which can be subclassed from Python code.

Note

The example is also interesting because it does reveal a problem with the vtkXMLReader::SetIntputString which is the memory footprint. When using this function, the reader will hold 2 copies of the data when decoding, because it stores the input string, and then create a std::istringstream which also copies the data.

Here is the memory footprint for the following code, with buffer containing a 700Mio VTI:

1: std::string buffer{...};
reader->SetReadFromInputString(true);
2: reader->SetInputString(buffer);
3: reader->Update(); // 3 is during update, after this->OpenStream()

Capture d’écran 2022-10-10 105722

And the memory footprint using the new approach:

1: const std::string buffer{...};
vtksys::MemoryResource resource{buffer.data(), buffer.size()};
vtksys::ResourceStream stream{resource};
reader->SetReadFromInputStream(true); // Must be set to use user-provided stream
2: reader->SetInputStream(stream);
3: reader->Update(); // 3 is during update, after this->OpenStream()

Capture d’écran 2022-10-10 104348

Though, it could be possible to remove this problem by storing the std::istringstream only, and freeing the initial buffer after SetInputString. This change is not very hard, and I have it locally, but is cause a change in vtkXMLUnstructedGridReader.

I haven’t tested if other readers share this behavior.

Conclusion

Here is all the things I wanted to show for this proposal!

Here is some questions I want to ask:

  • What are your thoughts on this proposal ?
  • Do you think a features should belong to kwsys or VTK directly ?
  • Does the resources/streams should be manipulated through shared pointers ?

Thank you, Alexy.

Ping: @Francois_Mazen @finetjul

3 Likes

/cc @spyridon97

In the future we plan to use fmt for writing files and scnlib for reading files. Both libraries make their case that you should avoid using streams for performance reasons. Feel free to check benchmarks at GitHub - eliaskosunen/scnlib: scanf for modern C++ and GitHub - fmtlib/fmt: A modern formatting library.

fmt and scn are made for formatted output and input of strings, which is not the same as what I’m proposing here.
They are better than iostream, and libc equivalents, at parsing and extracting single value such as integers and floats. There are other libs like this, like fastfloat, C++17 added similar function std::to/from_chars, but they can not represent a data source.
We can use both, the stream which just serves at getting the input data (using istream::read), and scn, or any other library such as expat for XML, for parsing it. Like this, we won’t have significant performance issues caused by iostreams, giving us the best of each world!

Fmt and Scnlib were made for fast, formatted, and type-safe IO compatible with the python formatting style. They are much faster than streams, and they are even faster than c functions by a significant percentage, Fast float is used by scnlib. and falls back to std::from_chars only if necessary because it’s slower. fmt is faster than std::to_chars. All this information is extracted by looking at their repository information.

Scnlib:

Fmt:

As far as your argument about using them together. I understand your point that they don’t represent a data source though, but if i remember correctly you can use std::File* as a data source. You might be able to combine fmt/scnlib with streams, but you will probably not gonna get all the performance benefits that they can provide you with, but that’s something that needs to be evaluated. Finally, both libraries’ aim is to basically replace both c functions and streams. And they will soon become part of the c++ standard (std::format by fmt is already part of c++20).

Hi @alexy.pellegrini , thanks for your work.

What are your thoughts on this proposal ?

This is awesome and very much needed !

Do you think a features should belong to kwsys or VTK directly ?

I think this has all its place in kwsys, at least the lowest layer

Does the resources/streams should be manipulated through shared pointers ?

If this is what is needed to reduce the memory footprint, yes.

I don’t know much about fmt.

FYI @toddy @ben.boeckel

1 Like

I’m very much in favour of wrapping std::istream/std::ostream into a new custom VTK stream type which is itself not a C++ standard stream. There is currently a problem on Windows where std::streampos/std::streamsize is only a 4 byte integer.So files larger than 4Gb can not be handled.

1 Like

This problem only exists with MinGW-w32:

  • Windows 64-bits, check out this compiler explorer.
  • Windows 32-bits, check out this compiler explorer.
  • With MinGW-w64 is also 64-bits, since compiler explorer does not have a MinGW I checked locally.
    • sizeof(streamsize) = 8, sizeof(streamoff) = 8, sizeof(streampos) = 16
  • With MinGW-w32 streamsize is indeed 32-bits, but I doubt any serious users will use MinGW-w32 AND load file larger than 4Gio…
    • sizeof(streamsize) = 4, sizeof(streamoff) = 8, sizeof(streampos) = 16

std::File* is not customizable, excluding platform-specific tricks.

std::ifstream::read and std::fread has no significant performances differences, because the first will just call the second on common implementations. What is bad with iostream is the formatted input/output, aka >> and <<.
The following code:

std::FILE* file{...};
std::array<char, 512> buffer;
std::fread(buffer.data(), buffer.size());
scn::scan(buffer, "...", ...);

Won’t be significantly* faster than:

std::ifstream ifs{...};
std::array<char, 512> buffer;
ifs.read(buffer.data(), buffer.size());
scn::scan(buffer, "...", ...);

* There will be 1 virtual calls and a few more “classic” calls, which is nothing compared to IO and parsing.

When scnlib’s author says " This library attempts to move us ever so closer to replacing iostream s and C stdio altogether.", the “closer” is very important, because it replaces formatted memory IO: scanf and iostream::operator<< and >>, but it can’t replace raw file, memory or even network IO, which is what I aim for with this proposal.


I figured out a case where there will be a performance impact with the proposed method is the case where a buffer is already in memory and only needed for read-only operations.
In that specific case, streaming that buffer will make useless copies of data. If we don’t switch to a custom stream type, unrelated to the standard streams, we could enable a “zero-copy optimization”, using a flag or whatever that says, use my buffer, not yours.
This could be possible by given to the stream the task of providing the buffer, basically instead of stream.read(buffer, size) do buffer = stream.read(size).

Thanks @alexy.pellegrini for this technical proposition. I hope it will leverage new VTK usage with modern storage like object storage or web resources.

I don’t like the need to have two objects to instanciate for the user. Could we just have an abstract ResourceStream object?

Agreed, the high level interface should not depend on std::io stuffs (nor templated code) to avoid compilation burden.

It would be possible to get rid of the iostream base (and make an adaptor if needed), and not splitting the resource and the stream, I did this because I followed the stream/streambuf approach but it is not necessary.

1 Like

AFAIU fmt and scn are efficient for formatting / building buffers at runtime. Since here we just use streams to access data already loaded in memory I don’t think there is any additional overhead using Alexy approach, but I may be missing something obvious.

Benchmarking this would never be a bad thing though.

As discussed, I got rid of the standard streams for the base implementation, and made an adaptor to std::istream for smoother transition.
An MR has been done on KWSys: https://gitlab.kitware.com/utils/kwsys/-/merge_requests/264

Hi everyone, since this feature will not be included in kwsys, the MR as been moved to VTK directly, you can find it here: https://gitlab.kitware.com/vtk/vtk/-/merge_requests/9663

Classes has been modified so match VTK style of memory management.

2 Likes