https://invisible-island.net/xterm/bad-utf8/
Copyright © 2020,2024 by Thomas E. Dickey


XTerm – Handling Ill-formed UTF-8

Background

Markus Kuhn added rudimentary support for UTF-8 in the Linux console in 1996. Later, in April 1999, he began the support for UTF-8 in xterm (patch #97). Kuhn's initial implementation in the Linux console did rudimentary error checking, discarding unexpected input. Kuhn's changes for xterm were adapted from the Linux console source, but adding a comment:

    /* Combine UTF-8 into Unicode */
    /* Incomplete characters silently ignored,
     * should probably be better represented by U+fffc
     * (replacement character).
     */

Actually, the proper Unicode replacement character is U+FFFD, but that was a start. Shortly after, Kuhn provided a patch to use U+FFFD (in xterm patch #99).

Later that year, Kuhn created a demonstration file for valid input, and a test file for invalid input. He made minor improvements to both over the next few years, but no substantial revisions to account for changes to the Unicode conformance. The Internet Archive has a succession of versions of UTF-8-test.txt from Kuhn's website, but it has changed only a half-dozen times. The Internet Archive lacks the first two versions; here is a complete set:

XTerm (barring the occasional bug report) works with that file. XTerm also works with the demonstration file UTF-8-demo.txt, but this page is about the former, the test-file.

Others created similar demonstrations, e.g., Frank da Cruz' UTF-8 Sampler at Columbia as part of the Kermit project. Originally written to promote Kermit 95, da Cruz refocused it in 2011, noting

This, however, is a Web page, which started out as a kind of stress test for UTF-8 support in Web browsers, which was spotty when this page was first created but which has become standard in all modern browsers.

But developing a test suite for Unicode has been neglected. While the Unicode documentation takes about a hundred pages to describe how one might develop tests for conformance, it does not provide a reference implementation. As a result, developers have been presented with the opportunity of interpreting the (always incomplete) documentation. These provide interesting reading:

Providing correct data (and mostly complete procedures) is good for demonstration purposes. Applications have to handle error conditions consistently.

Problems

Kuhn's test file is simple enough. One tests the terminal by sending the test-file to the terminal, making it display the result. One gauges the success of the test by checking that a vertical bar “|” is in column 79. Because it contains ill-formed UTF-8, some of the expected display will be the replacement character. The position of the vertical bar takes that into account.

The test file has a few problems:

The test-file was intended for terminals, but after all, this is Unicode which is supposed to work everywhere—even with a web browser. Someone attempted to transform Kuhn's test file into a webpage, which can be seen on W3C's website. It looks different from xterm:

Compare test-file in xterm and Firefox
xterm Firefox
xterm showing section 2 Firefox showing section 2

On the Firefox side, the vertical bars do not line up. There are two reasons for this:

There are other differences of course. Web browsers have no concept of control characters (aside from whitespace), so they are guaranteed to do the wrong thing when told to handle a padding character.

On xterm's side, some of the characters for which Firefox displays a replacement character are shown as empty boxes. For a while (from patch #233 to patch #334), xterm would have shown a replacement character, but the current scheme avoids doing that for characters which appear to be valid but missing.

Comparing the results in a web browser was an issue to explore because Dan Gohman suggested changes to xterm's error handling would have it imitate Firefox (or equivalently, act like a terminal whose developers imitate Firefox). That was motivated by a comment in the Unicode chapter 3 on conformance:

U+FFFD Substitution of Maximal Subparts

An increasing number of implementations are adopting the handling of ill-formed subse-quences as specified in the W3C standard for encoding to achieve consistent U+FFFD replacements. See:

http://www.w3.org/TR/encoding/

which overlooked the following (quoting from Unicode 13):

The Unicode Standard does not require this practice for conformance. The following text describes this practice and gives detailed examples.

The reference to W3C deals with the way Firefox displays multiple replacement characters. W3C's recommendation for Unicode is moot with regard to terminals, and actually not everyone agreed that it was suitable for browsers. For example in this page

How Many REPLACEMENT CHARACTERs?

Henri Sivonen disputes that. At the top of the page, he notes that the Unicode and ICU organizations amended their wording to avoid the appearance that W3C's approach is recommended (or “best practice”).

Ignoring that conclusion, the suggested changes might be relevant in terms of browser-like features imitated in some other terminals. In that regard, it could be useful as a resource setting for people who must deal with scripts which rely upon this feature.

Comparisons

I decided to explore this by comparing the results from Kuhn's test-file with and without Gohman's changes.

Because those changes are incompatible with longstanding practice (more than 20 years), I refactored Gohman's changes as a new resource setting utf8Weblike.

By making the choice a resource, it is possible to make a test-script which exercises the terminal with/without the feature and compare the results. My script (called bad-utf8) uses the terminal's cursor movement and reporting controls to determine where the vertical bar is, and produce a copy of Kuhn's test-file which would give the expected results (by adding or removing spaces before the vertical bar).

Because those controls are part of the common VT100 repertoire, that script can also run successfully on other terminals.

Terminals compared

I used the script to collect information on these terminals:

Twenty should be enough. There are other terminals, but they have not distributed packages (or they are duplicates).

For testing, I used these machines (aside from xterm, the package versions cited above were current on those):

Counts of matches

Kuhn's test-file numbers each of the test cases. My script reports “0” for each terminal if the test matched exactly. Each line in the test-file which does not match exactly adds one in the report. Most test cases have only one line of data; a few (e.g., 3.1.9, 5.3.3 and 5.3.4) have more than one.

Linux console, xterm (i.e., patch #358) and PuTTY were the only terminals which matched Kuhn's test-file closely (5-6 differences out of 98 data lines).

The table is here.

Amount of mismatch

The bad-utf8 script constructs a file which could be sent to the corresponding terminal with the vertical bars aligned. It does this by adding spaces.

Counting the number of spaces needed may give more insight to how closely related the terminals are for error handling. Reviewing the results, you may see that the relative ranking does not change appreciably.

The table is here.

Cross-terminal differences

I wrote another script, diff-utf8 to process the data collected with bad-utf8.

Beyond seeing how closely a given terminal matches the assumptions in Kuhn's test file, one might want to know if there are groups of terminals which give similar results. It turns out that there are two groups (more than two terminals having close matches):

A few other terminals are close to one of those groups. Most are not close.

The table is here.

Analysis

That table is large, and you may not pick out the pattern easily. In vile, I can see this using the editor's highlighting:

xterm showing section 2

To explore this, I added a report to diff-utf8. Initially, I had only tested vte-0.60, until noticing that adding an earlier version would help explain the relationships among these terminals:

** pairwise report
.. level 0
xterm-w3c vs vte-0.60
xterm.js vs vte-0.60
xterm.js vs xterm-w3c
.. level 1
putty vs linux (2.2.1)
vte-0.46 vs konsole (2.1.2)
vte-0.60 vs macos (2.1.2)
xterm-358 vs linux (2.1.2)
xterm-w3c vs macos (2.1.2)
xterm.js vs macos (2.1.2)
.. level 2
vte-0.46 vs macos (3.3.7, 3.4)
vte-0.60 vs konsole (3.3.7, 3.4)
xterm-358 vs putty (2.1.2, 2.2.1)
xterm-w3c vs konsole (3.3.7, 3.4)
xterm.js vs konsole (3.3.7, 3.4)
.. level 3
macos vs konsole (2.1.2, 3.3.7, 3.4)
vte-0.60 vs vte-0.46 (2.1.2, 3.3.7, 3.4)
xterm-w3c vs vte-0.46 (2.1.2, 3.3.7, 3.4)
xterm.js vs vte-0.46 (2.1.2, 3.3.7, 3.4)
.. level 4
.. level 5
.. level 6
.. level 7
.. level 8
.. level 9
macos vs iTerm2 (2.2.1, 3.3.2, 3.3.3, 3.3.4, 3.3.5, 3.3.8, 3.3.9, 3.3.10, 3.4)

Recapitulating that in words:

In short,

The differences in xterm patch #344 versus patch #358 bear explanation. Patch #344 happens to be the version provided in the current Debian 10 (stable), so it was useful for comparison. Before Gohman's suggested changes, he submitted a bug-report which required some scrutiny of Kuhn's test-file, since it pointed out an additional case overlooked in the fixes for patch #268. Since this was before writing bad-utf8, with visual inspection alone, it was easy to overlook an additional problem introduced in patch #328. Both are fixed in patch #357, but patch #358 was the most recently published version of xterm when this page was created.

The history of other terminals' handling of ill-formed UTF-8 is just as complicated. Confining it to the story of how web-browser behavior came to be relevant to terminals, we have this:

Unresolved Issues

The bad-utf8 script works by printing a line from the test-file, finding where the cursor is after printing, and computing an adjustment. It attempts to handle line-wrapping, but in practice some of the terminals tested differ too much in their handling of wrapping to make that work reliably. Running the script on a terminal sized at least 90 columns wide gives good results. I chose the Arch Linux console for testing because it had been set up as a wide display.

Besides wrapping, one must be careful to not interrupt the terminal while it is processing the test-cases, because that can interfere with the cursor-position report.

The bad-utf8 script updates a CSV-file for each completed test on a given terminal, while also writing a copy of the adjusted test-file. There are two CSV-files, chosen according to whether an option is used to tell it to report test successes or the amount of adjustment needed. The adjusted test-files in either case should be identical. I ran the bad-utf8 script more than once, ensuring that those files were in fact identical (indicating that no data was lost or corrupted due to timing or inadvertant interference with the cursor-position reports).

Scripts