vtkStdString is introduced because in pre C++11 world, the symbol names are too long given by the standard STL string. Since now C++11 has improved a lot on std::string, this class has finished its mission. Should we replace all vtkStdString usage with std::string in vtk code base? It can save the cost of dynamic_cast between vtkStdString and std::string.
If we touch this then I would much rather see vtkStdString (and plain char*) replaced by an encoding-aware string class, such as vtkUnicodeString.
The standard provides the codecvt
library for encoding conversions since C++11. Could you provide more info on how an encoding aware string class can help in VTK(Humm, Non-ASCII letters and foreign language)?
For reference, I introduced vtkStdString
in commit af72bcd0e in 2003. Back then C++ ABI mangling did a poor job with std::string
. Today they are much better even for C++98. The improvement is independent of C++11.
We shouldn’t need the wrapper anymore.
I would much rather see vtkStdString (and plain char*) replaced by an encoding-aware string class
Improving internationalization support is a good goal, but IMO that’s orthogonal to this proposal. All existing uses of const char*
and std::string
would need to be evaluated in such an effort regardless of vtkStdString
.
FWIW, some applications handle internationalization by unifying on UTF-8 encoding for const char*
and std::string
values, so converting away from those is not necessarily a requirement. Either way, that’s outside the scope of the vtkStdString => std::string
proposal.
It should not matter much if conversion methods are in the string class or outside, but it would be much safer to store encoding explicitly in the string class. It would be hard to enforce a convention such as “std::string in VTK is always UTF-8” throughout VTK and all VTK-based libraries and applications.
The main motivation is to be able to pass strings between VTK, Python, and Qt without loss of information.
So basically A VTK version of QString is requested here?
Edited:
I will try to open a MR to clean up the vtkStdString usage first. The discussion of adding encoding support to VTK should be discussed in another thread.
Yes. Note that such class already exists (vtkUnicodeString). Maybe it could be renamed to just vtkString to make it shorter.
We cannot enforce all the libraries an application uses to follow the same convention. If various parts of the code use different encodings in std::string then it is very hard to ensure the code is correct. If there is no automatic conversion between vtkString and std::string then no such problem exists.
If you can implement this so that none of the VTK-based applications need any change then it is fine. If we make any backward-incompatible change then it is better to bundle it with useful features (just making VTK code simpler or nicer is not good enough reason for breaking API compatibility).
What is vtkUnicodeString anyway? Unicode is not an encoding format. I believe this should be called vtkUCS4String.
How did VTK handle char* and std::string before unicode? As an ansi string without extended characters? Since UTF-8 was specifically designed to encode ansi (128 chars) to unicode without pain/bloat, it makes sense that it should be the encoding format supported by char* and std::string now. If it is well documented it should present no problem to third party libraries.
It stores a string that you can get/set in utf8 and utf16 encoding. See details here. It is already used extensively in text rendering, arrays, tables, certain file export/import.
Instead of using some specific encodings that the string class currently supports, it could be more generic, such as vtkString (similarly to QString).
The problem is that if you just assume that char* and std::string is always UTF-8 encoded in VTK then it is hard to make sure that decoding/recoding happens correctly at all the interfaces (VTK / operating system, VTK / VTK-based third-party-libraries, VTK / application).
So vtkUnicodeString is a UCS4 string with conversion methods to/from UTF8/UTF16. The class name is ambiguous.
This blog is worth reading. Using UTF-8 as the internal representation for strings in C and C++ with Visual Studio | Nubaria Blog
I agree that it makes sense to store all strings with the same encoding in an application and utf8 in std::string or char* could work quite well.
However, adding a wrapper class over std::string would have many advantages. Since there would be no automatic conversion, we could ensure encoding/decoding is not missed accidentally. It may be also more convenient and efficient to have a string class with built-in encoding/decoding methods. The wrapper class can also store a string with any encoding, so you may be able to store and retrieve a string without forcing two re-encodings, or use encoding that is most suitable in a specific environment.
If you’re suggesting replacing all occurrences of char* with vtkFancyString in the public API, I agree that would be a robust solution to string handling. However it would add extra overhead to method calls and it seems to be the opposite of what @Haocheng_Liu is suggesting here. I also wonder how many downstream users would balk at such a breaking change.
It is a significant problem that VTK cannot robustly read/write files that have special characters in their names or may generate invalid files if you change the locale. Probably majority of application developers would accept breaking changes to solve these issues. To make transition easier, we could make vtkString an alias of std::string or enable automatic const char* conversion (temporarily, controlled by a CMake flag).
Thanks for all the feedbacks everyone. This discussion helped in formulating a potential design and plan that I summarized here. We may get back to this later, when ongoing refactoring efforts (such as oriented image data support) are completed and a driving project with appropriate funding is identified.