Feature: String tokenization

Summary

VTK now includes a constexpr (compile-time) string hashing algorithm. This enables things such as more legible (and faster) code when comparing a test string to multiple potential matches:

#include "vtkStringToken.h"
using namespace vtk::literals; // for ""_hash used below.

vtkStringToken task;
switch (task.GetId())
{
  case "foo"_hash: foo(); break;
  case "bar"_hash: bar(); break;
}

The other use-case for vtkStringToken is to tokenize a vtkStringArray into a vtkTypeUInt32Array and use the resulting integer array for visualizations.

Details

In the example above, the ""_hash string-literal operator computes the hash of the strings at build time. Then, at run time, the task’s hash is computed and compared to the fixed integer hash values in the switch() statement.

You can also construct string tokens at run-time like so:

const char* ex1 = "baz";
std::string ex2a("foo"), ex2b("bar");
vtkStringToken ex1t(ex1, std::strlen(ex1));
vtkStringToken ex2t(ex2a + ex2b);

The following sections cover more implementation details and caveats when using string tokens.

Comparisons and Containers

Exact equality and inequality tests between string tokens is fast: a single integer comparison, while inequality tests attempt to compare the source strings alphanumerically so that lexical ordering is preserved.

ex1t == ex2t; // Fast integer math
ex1t != ex2t; // Fast integer math
ex1t < ex2t;  // Slow string lexical comparison

The string token provides a specialization of std::hash<vtkStringToken> which simply returns the hash-code. This allows you to use string tokens in unordered maps and sets.

Because the less-than (<) operator is slow, you should always prefer unordered sets and maps when using string tokens for indexing.

Recovering the string from a hash

Hashes computed at run-time by calling the constructor of vtkStringToken can be recovered by calling the token’s Data() method. This is possible because the vtkStringToken class owns a static instance of vtkStringManager, which holds a map from hash values to the original strings.

If you call token.Data() on a hash computed at build time (or even at run-time using the static vtkStringToken::HashString() method), the string will not be present in the vtkStringManager and an empty string will be returned; a warning message will also be emitted on the first failed lookup.

Hash algorithm

We use the Fowler-Noll-Vo FNV-1a hash algorithm to produce a 32-bit unsigned integer (type-aliased to vtkStringToken::Hash).

This hash is not intended for cryptographic use – not that a serious person would think to use a visualization library to store passwords… but just in case you are feeling a bit silly, don’t do it.

String literal operators

VTK provides two string-literal operators to perform hashing. Both are declared in the vtk::literals namespace in vtkStringToken.h.

  1. ""_token, which produces a vtkStringToken instance by hashing the string.
  2. ""_hash, which produces an integer of type vtkStringToken::Hash by hashing the string.

Both of these perform the hash computation during compilation. As mentioned above, this means the vtkStringToken::Data() method will return an empty string. If you create a class that returns tokens created with the string-literal operators (for example, to avoid hash computation in a tight loop), you can work around this by constructing run-time hashes of the strings at some point in your code that is not run frequently (such as a static initializer called from your class constructor or application setup).

2 Likes