What is state of the art: Unicode file names on Windows

ben.boeckel · October 13, 2019, 1:55am

You’ll need to fork the main kwsys repository (linked in the MR) and make an MR for it. Brad will review it there. I suspect the changes aren’t as straightforward as they seem; that library supports dozens of platforms and I don’t know if std::codecvt stuff can be assumed to work on all of them. Once it is merged into master there, running the update.sh will bring the changes into your branch.

Agreed. I’d like to see vtkStdString gone as well (though deprecation is OK, I don’t think we should keep it around; it’s use has been discouraged for years).

toddy · October 13, 2019, 3:38am

Hmm. In that case I’m going to introduce new methods WideToUTF8(), UTF8ToWide() and FileOpenUTF8()

efahl · October 13, 2019, 12:33pm

These are the specific classes that started me on this quest:

for reader_class in vtk.vtkBMPReader, vtk.vtkJPEGReader, vtk.vtkPNGReader, vtk.vtkTIFFReader, vtk.vtkPNMReader:
     reader = reader_class()
     reader.SetFileName(file_name)
     reader.Update()

It appears that David Gobbi’s DICOM reader already handles Unicode properly, as including ‘vtkDICOMReader’ in the above list simply complains appropriately about invalid file contents and lists out the name as expected in the error message.

toddy · October 14, 2019, 9:23am

Since images are all binary file types, I’m not worried about handling the contents just the file name.

lassoan · October 14, 2019, 12:47pm

Many file formats that VTK supports are text-based. Ideally, for each file format you would check if there is any guidance on how to handle special characters.

If there is guidance (e.g., for DICOM and XML based formats), then you need to implement that.

If there is no guidance, then you can define a sensible behavior. You can write as UTF-8, or if that breaks commonly used parsers then force ASCII. For reading, you may use some heuristics, such as checking BOM and/or check if the content can be successfully interpreted as UTF-8. Some files are mixed text/binary/base64, so you may need to extract text components before you are trying to guess its encoding.

These should be all doable, but requires some effort. You cannot ignore the file content.

efahl · October 14, 2019, 3:52pm

Exactly, I’m assuming that the only bug to be resolved here is wrt the file paths themselves. Any text-based-image readers are already doing whatever is appropriate for the content, or they have bugs orthogonal to this issue.

lassoan · October 14, 2019, 4:32pm

VTK was not implemented with UTF-8 encoding in mind, so without doing any code review or testing, it is more reasonable to assume that things don’t work correctly with UTF-8 content.

Just a few examples of why things may break: File path manipulation methods (get filename from paths, etc.) may not work correctly for some UTF-8-encoded paths (as you may have ‘/’ character as part of a unicode character). XML files written by VTK don’t contain declaration, so readers can only guess what encoding is used. Filenames are often saved into file headers, therefore by using UTF-8 file names, you introduce UTF-8 content to files that did not have it before.

ben.boeckel · October 14, 2019, 4:54pm

No? UTF-8 was designed specifically to not have that problem. Specifically here, any ‘/’ byte in a UTF-8 encoded sequence is guaranteed to be the ‘/’ codepoint. I suspect VTK is actually very UTF-8 safe. The main issues come up with assuming byte length == display length and the like, splitting strings, and more. Those are the operations that need to be looked at for UTF-8 safety.

dgobbi · October 14, 2019, 5:58pm

Yes, ASCII will pass unchanged through UTF-8, and any non-ASCII octets in UTF-8 have the high bit set, so they are never mistaken for ASCII. This is in stark contrast to ISO-2022-JP, Shift-JIS, GB18030, etc etc which cause tons of headaches.

I’m hugely in favor of using UTF-8 everywhere.

lassoan · October 14, 2019, 6:19pm

OK, that’s good that UTF-8 does not have issues specifically with path splitting (I remembered something like that but I could not find a reference quickly).

This may or may not be the case. We should add tests to confirm.

It is awesome that this activity has started and even not everything will work correctly at first, the issues can be fixed as they are discovered.

toddy · October 14, 2019, 9:05pm

The only way that mixed content can be handled reliably is if the binary data is base64 encoded. Then the whole document is utf-8 encoded.

In general yes, but I was referring to the file readers listed by @efahl in my comment.

toddy · October 14, 2019, 9:17pm

@lassoan “/” has ASCII code 47 which is unchanged by utf8 encoding. That’s the beauty of utf8 with respect to altering/finding file paths and extensions etc. Anything within ANSI 128 is encoded as is. This was actually a request of american experts, so that pre-existing documents, written in plain english, would contain the exact same number of bytes and require no change in format. Also no need for a byte order mark, BOM, with utf8.

If VTK were not already mostly utf8 safe, I think you would have encountered many issues from Linux/Mac users long ago.

@ben.boeckel I looked through some of the string splitting and so far it appears to be a non-issue.

Text display issues (lengths etc) only need to be handled/converted close to where they are displayed/sorted. Otherwise they can just flow through VTK untouched.

toddy · October 14, 2019, 11:53pm

I suspect the Python and Java wrappers cannot handle a std:string in the public API, so either vtkStdString needs to be retained or replaced with const char* parameters.

ben.boeckel · October 15, 2019, 12:10am

Seems to work fine for this call in Filters/Modeling/Testing/Python/HyperScalarBar.py:

scalarBar.SetLabelFormat("%-#6.3f")`

given this:

Charts/Core/vtkAxis.h:  virtual void SetLabelFormat(const std::string &fmt);

ben.boeckel · October 15, 2019, 12:11am

Hmm. That might not be the right LabelFormat. In any case, the wrappers should support it if they don’t already. @dgobbi?

toddy · October 15, 2019, 12:48am

That’s actually a const char* parameter
from vtkScalarBarActor.h

  //@{
  /**
   * Set/Get the format with which to print the labels on the scalar
   * bar.
   */
  vtkSetStringMacro(LabelFormat);
  vtkGetStringMacro(LabelFormat);
  //@}

#define vtkSetStringMacro(name) \
virtual void Set##name (const char* _arg) \

dgobbi · October 15, 2019, 1:29am

The Python and Java wrappers accept std::string, and they accept utf8 encoded strings. Can’t remember what they do with invalid strings, I’ll check and reply in this thread.

dgobbi · October 15, 2019, 1:53pm

Python wrapper behavior with respect to utf8:

Python str() objects are converted to utf8 when passed to VTK
Python bytes() objects are passed as-is to VTK
Each VTK std::string (and const char *) is checked by the Python wrappers and,
– If it is valid utf8, it is converted to Python str()
– If it is not valid utf8, it is converted to Python bytes()
– If a “const char *” is nullptr, it is converted to Python “None”

In other words, when VTK uses utf-8 std::string, there is direct correspondence between std::string and str(). This assumes Python 3.x, see Wrapping/Python/README.md for more info.

The Java wrappers convert Java strings to utf8 before passing them to VTK, and attempt to decode VTK strings as utf8 (with undefined behavior if they are not utf8).

toddy · October 15, 2019, 11:11pm

How is utf8 validity determined? Unless binary data (bytes) are converted to base64, I foresee a lot of problems handling that string data in VTK.

Why are Python byte arrays passed through string parameters at all? Shouldn’t they all be converted to utf8? I would have thought binary data parameters would be unsigned char.

ben.boeckel · October 15, 2019, 11:20pm

Binary data gets passed around as vtkDataArray or the like usually. Or void*.

Binary data should never be passed around as strings (one of the many problems with C’s char* representation for strings).