D. J. Bernstein
Internet publication
djbdns

Internationalized domain names

People want to use internationalized domain names (IDNs): domain names such as αβγ.com. (That's ``alpha beta gamma dot com''; your browser should display the name with Greek letters.)

You should be able to register an IDN, set up computers under the name, connect to those computers by name, set up web pages under the name, set up links to those web pages, browse those web pages given the name or a link, send email from an address under the name, receive email at that address, etc.

Unfortunately, as of 2002.11, for most practical purposes, domain names are limited to the following ASCII characters: abcdefghijklmnopqrstuvwxyz, ABCDEFGHIJKLMNOPQRSTUVWXYZ, 0123456789, hyphen, and dot. There are several different reasons for this limitation.

This page discusses two proposals for allowing IDNs: first, a proposal of mine called IDNC3 (``clean, careful, conservative IDNs''); second, an extremely damaging proposal called IDNA (``IDNs in applications''). I have a separate page discussing some of the relevant software bugs.

Objectives of IDNC3

Primum non nocere. ---Hippocrates, as translated by Galen

IDNC3 is a set of changes to the Internet protocol suite and to some networking programs. The goal of IDNC3 is to let people use IDNs. IDNC3 addresses all the reasons that domain names are limited to ASCII.

IDNC3 explicitly recognizes that the value of a character-set expansion comes entirely from the visibility of the additional characters to users. There is no point in merely expanding the set of bytes allowed inside the computer; the internationalized domain name αβγ.com must be displayed with Greek letters on a typical user's screen.

Damage caused by IDNA

A careless character-set expansion, such as IDNA, will hurt Internet users in several ways. IDNC3 is designed to proceed cautiously, avoiding all of these problems.

Interoperability failures. Imagine a user registering an IDN and then encountering disasters that he would not have encountered with an ASCII domain name: people sending him email and having the email bounce, for example, or people clicking links to his web page and seeing ``page not found.''

The IDNA proposal will trigger failures of this type. IDNA tells programmers to translate 7-bit byte strings to character strings according to ``PunyCode'' rather than ASCII; PunyCode is the same as ASCII for most strings, but changes the interpretation of some special 7-bit strings (which, presumably, are not in use anywhere) to allow characters such as Greek alpha. The problem is that the resulting character strings will often be copied (through pipes, copy-and-paste, and other cross-program data-transfer mechanisms) to programs that don't know anything about PunyCode. Subsequent domain-name lookups will fail. These failures do not occur with ASCII domain names.

In contrast, IDNC3 does not allow IDN registration until all relevant software has been upgraded and tested.

Inconsistent displays of the same name. Imagine a user registering an IDN and then finding that the name is displayed to typical users around the Internet as something completely different from what he expected. This problem does not occur with ASCII domain names.

The IDNA proposal creates this problem by deliberately allowing registration of special 7-bit strings long before most programs have learned how to display PunyCode.

In contrast, as noted above, IDNC3 does not allow IDN registration until IDNs are handled properly by software.

Unnecessary implementation and deployment costs. Imagine requiring thousands of programmers to spend time writing code, and requiring millions of users to upgrade their software, for no good reason.

The IDNA proposal uses the special-purpose ``PunyCode'' encoding, which has no other applications and no previous software support. IDNA also requires that programmers engage in a complicated conversion of uppercase non-ASCII characters to lowercase. It will be impossible to correct this mistake later if any users start relying on uppercase characters.

In contrast, IDNC3 uses UTF-8, which is by far the most widely supported ASCII-compatible encoding of Unicode. RFC 2277 already requires UTF-8 in Internet protocols. Many programs already support UTF-8 and do not need to be upgraded.

IDNC3 also prohibits registration of uppercase non-ASCII characters, and neither requires nor encourages case conversion for non-ASCII characters. This decision can be safely changed later if it turns out that users really need uppercase characters.

Multiple semantically similar names. German ö (o umlaut) means the same thing as oe. Hebrew vowels may be omitted. I am told that typical Chinese words can be written in many different ways. Imagine being forced to register 2, or 8, or 256 separate domain names for one small company.

IDNA ignores this problem, even though these semantic similarities are all much closer than the similarity of uppercase and lowercase. IDNA allows practically all Unicode characters. It will be impossible to correct this mistake later if any users start relying on separate registrations of semantically similar strings.

(On 2002.10.24, the IESG labelled the IDNA documents as ``Proposed Standards,'' claiming that ``There was WG rough concensus to advance these documents.'' In fact, the IDNA proponents have received, and are ignoring, public objections from hundreds of people. The single largest source of objections is IDNA's mishandling of semantic similarity.)

In contrast, IDNC3 allows a limited set of semantically dissimilar characters, and prohibits registration of other characters. Programs around the Internet are required to handle all characters, so that expanding the allowed character set later will not cause any problems.

Identical displays of different names. Imagine registration of wellsfargo.com with the first o replaced by a Greek omicron: wellsfargο.com. In proper typesetting, and in common Unicode fonts for computers, the omicron and o have the same appearance.

Users frequently check whether domain names are the same by reading the names. This is a safe procedure with ASCII domain names, if the reader is careful. However, if Greek omicron is allowed, the same procedure becomes inherently unreliable, no matter how careful the reader is. In some situations the procedure can be exploited by attackers to violate security.

IDNA ignores this problem too. In fact, it exacerbates the problem, by treating α.com (lowercase Greek alpha dot com) equivalently to Α.com (uppercase Greek Alpha dot com), so that the confusion of Α.com and A.com extends to confusion of α.com and a.com. It will be impossible to correct this mistake later if any users start relying on separate registrations of Α (Alpha) and A.

In contrast, IDNC3 allows a limited set of visually distinct characters, and prohibits registration of other characters. Programs around the Internet are required to handle all characters, so that expanding the allowed character set later will not cause any problems.

Typing failures. Most domain names inside the computer are produced as copies of domain names already inside the computer: users click on links in browsers, reply to email, etc. However, users sometimes type domain names manually. Imagine being faced with a domain name on a business card and being unable to type it into your computer.

As noted above, IDNA allows ΑOL.COM (uppercase Greek Alpha O L dot com) and AOL.COM (uppercase A O L dot com) as separate domain names, even though these names are, when properly printed, identical. Some IDNA proponents say that typing is an important issue, but IDNA gives no explanation of how the user is supposed to distinguish Α (Alpha) from A, let alone figure out how to type Α (Alpha).

In contrast, IDNC3 (1) prohibits visually identical characters, as noted above, and (2) requires that systems support ISO 14755. Users should be able to type Unicode character 222E, for example, by typing Shift-Ctrl-222E. Then Unicode numbers can be placed on business cards, giving users worldwide a reliable way to type domain names.

IDNC3 phase 1: fixing bugs

IDNC3 has two phases. The first phase of IDNC3 makes all software changes necessary for the second phase: specifically, all failures to handle UTF-8 IDNs are identified and fixed.

During the first phase of IDNC3, IDNs are permitted in the domain name system only for tests. Registrars are not permitted to collect money for IDNs. Users who rely on IDNs are doing so entirely at their own risk. IDNC3 can be terminated if necessary at any time during the first phase.

IDNC3 replaces all ASCII domain names in the Internet protocol suite with UTF-8 domain names. Implementations are required to treat bytes 128 through 255 just as nicely as ASCII letters, and to interpret those bytes according to the UTF-8 standard for purposes of display. Failure to support UTF-8 is considered a bug. (Registrars are a special exception to this rule.)

Here are some examples of bugs to be fixed during the first phase of IDNC3, i.e., as soon as possible:

The costs of fixing these bugs and deploying the upgraded software are consistent with the conservative IDNC3 approach: these are bugs that should be fixed anyway.

It will be helpful for implementors if protocols updated by IDNC3 have their central specifications modified accordingly. For example, the SMTP specification should be revised to drop the current ASCII requirement.

The first phase of IDNC3 also requires ISO 14755 support.

Note that, although IDNC3 is technically limited in scope to domain names, implementors are also expected to take this opportunity to fix any UTF-8 bugs in mailbox names, login names, etc.

IDNC3 phase 2: registration

The second phase of IDNC3 allows registration of UTF-8 domain names. Registrars are permitted to collect money for names with characters in a selected character set.

The selected character set will be established during the first phase of IDNC3. It will contain a reasonably broad spectrum of useful characters, subject to the visual and semantic constraints discussed above. In case of doubt, the selectors will err on the side of caution, prohibiting risky characters.

Names may be selected by criteria more complicated than separate evaluation of every character. For example, traditional Chinese characters may be allowed in a Chinese .traditional top-level domain, but prohibited in a Chinese .simplified top-level domain.

The second phase of IDNC3 should achieve the IDNC3 objectives. User satisfaction with the system will be monitored during the second phase; any changes, such as additions to the selected character set, can be considered at that time.