What is state of the art: Unicode file names on Windows

Sitting here in the US on a Windows 10 box with LANG set to en_US.UTF-8, I tried to load an image from a Swiss patient with an ß in the file name. VTK couldn’t open the file - “file not found”, so after playing around with some candidate solutions, I searched around and found this old thread (@dgobbi, sorry to suck you back into the vortex :slight_smile:):

http://vtk.1045678.n5.nabble.com/VTK-and-Unicode-td5727348.html

tldr: Windows sucks at Unicode, so if you don’t have the proper codepage set (generally the same one used when creating the file name), then things will not work well (write a Chinese character into a file name, try to read it with Spanish codepage, and file not found, etc). Not an issue on macOS or Linux.

I had already tried the Win ShortName call, which “fixed” it, but was hoping there was something a little more elegant that had been implemented in the intervening five years. Does anyone know of improvements in Windows or VTK handling of this issue?

We would be interested in this, too.

Unfortunately, last time it was discussed there was no consensus even in how to store strings in VTK (see Proposal: Should we replace vtkStdString with std::string). Reviewing/updating file access APIs would be the next step after reaching an agreement.

C++17 filesystem seems like the most logical choice. When VTK decides to require C++17, that is.

String handling in general should be updated to use wide character strings.

I updated our applications to go from char to wchar_t, std::string to std::wstring, etc. and it wasn’t that difficult even though we have several million lines of code. for something with limited string handling like VTK, it shouldn’t be a huge project.

The tricky part was text file I/O and maintaining backwards compatibility with non-BOM files. I found some nice APIs that allow the file I/O functions to automatically convert from wide character to UTF-8 encoded strings (and vice-versa). There were only a few exceptions. If anyone wants to know how, I’d be happy to share.

Then by compiling the code in Visual Studio with Unicode, the wide character versions of the Win32 APIs will be used and filenames won’t be a problem (except for the path length limitation, which is a separate issue). It worked for us!

image
http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm

We should be using UTF-8 everywhere internally. VTK is likely missing lots of conversions to doing that on its boundaries (command lines, file formats, etc.). Wide characters are certainly the wrong way to go here because Windows is working towards also using UTF-8.

@ben.boeckel That’s also my view, so I will paste this link again
http://www.nubaria.com/en/blog/?p=289

Using UTF-8 is clearly the goal. The only question is how to reach it.

We could declare that all strings on VTK public interfaces are UTF-8 encoded (unless documented otherwise) and treat all non-compliances as bugs. However, it may take a long time until all the non-compliances are found and fixed, unless there is dedicated funding to get done with this quickly. Or maybe it could be a distributed community effort (developers could sign up for reviewing, fixing, adding tests to a couple of classes each)?

If the transition cannot be done quickly then it may be safer to use vtkUnicodeString (or introduce a new type or macro) to clearly distinguish variables that store text with UTF-8 encoding from those that use unknown encoding. If there is no automatic conversion between the UTF-8 and unknown text types then the compiler can ensure that there are no errors.

That is how I would do it. It would be a non-breaking change for anyone using only ANSI (128 chars) anyway.

Is there really such a big workload for this? If all char* and std::string parameters are assumed to be UTF-8 encoded byte arrays then they are simply passed through the code unchanged until
a) they are sorted/displayed
b) they need to be specifically converted for some function calls (e.g. UTF-16 for the Windows API)

That would suit my needs just fine, since I’m using Python and everything is Unicode already, so ensuring that it could be encoded into UTF-8 is trivial (and in fact we’ve already done that, which borked some of our users who have funky file system encodings, but that’s another issue).

I suspect the Python bindings have a very localized bit of code that does the str/Unicode marshalling right now and it would be pretty trivial to change it to force UTF-8 for char*/std::string…

1 Like

This is a good reference
http://utf8everywhere.org/

…and some simple C++11 conversion routines for Windows API calls
https://ryanclouser.com/2016/08/11/C-11-Convert-to-from-UTF-8-wchar-t/

I would say 6-12 months for a full-time developer, or several years for part-time developers. See details below.

There are about 2700 classes in VTK. All classes that interface with operating system or any of the 30+ third-party libraries, doing file operations, XML manipulation, console IO, render text, plots, widgets with labels, label mapper, display messages, interface with GUI, etc. are potentially impacted. There are 387 VTK file IO classes, so probably altogether about 500 files would need to be reviewed.

Developing the basic infrastructure - getting/developing converter classes for all platforms, taking care of vtkUnicodeString, Python interface, implement file IO on a few pilot classes, text rendering, and write tests for all these would take about 1-2 months for a full-time developer. In addition to this, you would need to spend about 10 minutes reviewing and fixing with each of the potentially impacted classes (500 classes in total), which would take about about 3-4 months for a full-time developer. In total, it would be about 6 months for a full-time developer. I’m usually too optimistic in workload estimations, so probably 12 months for a single full-time developer would be more realistic. If it is done by many part-time volunteer developers then it could take several years to get to all classes (and most probably classes that people rarely use or too complicated would never be reviewed/updated).

1 Like

I just pushed an MR for opening files with utf8 encoded filenames. It didn’t take long. However it seems I need to run Utilities/KWSys/update.sh to satisfy the Kitware bot. That is going to be a problem unless I switch my build to Linux.
https://gitlab.kitware.com/vtk/vtk/merge_requests/6065

I reviewed the MR. I’ll note that the update.sh script should work within git-bash, so running on Linux isn’t required. However, getting the kwsys changes into upstream is necessary first.

Thank you very much for working on this. There is a long way to go but this is a very good start. I’ll add some comments to the merge request.

Thanks Ben.

I see the KWSYS branch in the repository. Am I supposed to make the SystemTools changes in that branch and then run update.sh to merge them into VTK master?

I will build with testing now to make sure I get all the compiler errors.

No problem. It was only a few hours work. Also I had a look through vtkUnicodeString implementation and usage and I can’t see anywhere that a change would be required.

Perhaps @efahl could share his original code that highlighted the file loading problem. It would make a good test case.

Sure, it’s pretty trivial, you don’t even need a real file input file.

#!/usr/bin/env python

import vtk

file_name = ‘三维图片.png’
with open(file_name, ‘w’): pass

reader = vtk.vtkPNGReader()
reader.SetFileName(file_name)
reader.Update()

Error is:

ERROR: In IO\Image\vtkPNGReader.cxx, line 118
vtkPNGReader (000001B86DE6E3F0): Unable to open file 三维图片.png

Great. I mostly just wanted to know which VTK class you were using.
I’ll make up my own character sequence.

vtkUnicodeString should be deprecated and wherever it was used before, new APIs should be added that use plain strings instead.