Proposal for extending the use of Unicode in ECMAScript identifiers

Proposal for extending the use of Unicode in ECMAScript identifiers
Written by Michael Ang <mang@subcarrier.org>
Comments to Norris Boyd <norris@netscape.com>

I. Background

ECMAScript identifiers are currently specified as being Unicode. However, only the first 128 Unicode characters are allowed, effectively restricting identifiers to ASCII.

Implementations of ECMAScript are currently in use around the world. Developers whose native language is not English should be able to have identifiers that make sense to them. Although arbitrary strings can be used for named property lookup, allowing ideographs and other Unicode characters in identifiers will make it easier for global developers to write scripts.

Since implementations must currently accept Unicode characters, extending the range of characters allowed to that of the Unicode identifier class should not be an undue burden.

Java guarantees that escaped Unicode characters occurring in source code (in the form \uNNNN) will be unescaped before compilation. This can lead to problems in dynamic languages, for example when a newline character is escaped:

Program 1 (note that \u000A is the newline character):
int foo = 5;\u000Aint bar =6;

Program 2 (equivalent in Java, but not ECMAScript):
int foo = 5; int bar = 6;

Because allowing Unicode escapes in identifiers would complicate interpreter implementations, this is forbidden. Note that Unicode escapes are still allowed in comments and literal strings, but are not decoded.

Section 5.14 of the Unicode Standard v2.0 gives implementation guidelines for identifiers. Most identifiers legal under these guidelines are legal in ECMAScript. ECMAScript differs in that no provisions are given for ignoring formatting characters (which are forbidden).

II. Recommendations

These recommendations are made against the April 22 ECMAScript draft. Specific changes to the document appear in bold type.

§6 Source Text

Amend the first section as follows:
"However, non-ASCII Unicode characters may appear only within identifiers, comments, and string literals. In identifiers, the exact set of Unicode characters allowed is specified in Section 7.5 and corresponds to those Unicode characters with the property of alphabetic, decimal digit, combining mark, or ideographic. In string literals, any Unicode character may also be expressed as a Unicode escape sequence consisting of six ASCII characters, namely \u plus four hexadecimal digits. Within a comment, such an escape sequence is effectively ignored as part of the comment. Within a string literal, the Unicode escape sequence contributes one character to the string value of the literal."

§7.5 Identifiers

Amend the first section as follows:
"An identifier is a character sequence of unlimited length, where each character in the sequence must be a Unicode character with the property of alphabetic (category "L"), decimal digit (category "Nd"), ideographic, or combining. For historical reasons, the underscore (_) character and dollar sign ($) are also supported. The first character may not be a Unicode decimal digit.

Two ECMAScript identifiers are the same only if they have the same sequence of Unicode characters (as defined by their Unicode code points). This means that two identifiers with the same external appearance may not be identical. Composite Unicode characters are treated as distinct from their decomposed equivalents. For example, LATIN CAPITAL LETTER A (\u0061) followed by COMBINING RING ABOVE (\u030A) is distinct from LATIN CAPITAL LETTER A WITH RING ABOVE (\u00C5)."

The Unicode Standard v2.0 specifies implementation guidlines for identifiers (§5.14 Identifiers). These significant differences between ECMAScript and these guidelines should be noted:

since identifiers are compared based on the sequence of their code points, identifiers that appear identical may not be
no provision is made for ignoring layout and format control characters

Amend the BNF as follows:

"IdentifierName ::

IdeographicCharacter

IdentifierName CombiningCharacter

IdentifierName Extender

IdentifierName IdeographicCharacter

CombiningCharacter

A CombiningCharacter is a Unicode character with the normative combining property.

Extender

An Extender is a a Unicode character in a set defined in §5.14 of the Unicode Standard 2.0. (XXX should expand this reference.)

IdentifierLetter :: one of

[ASCII table with _ and $]

Additionally, an IdentifierLetter may be a member of the Unicode letter class (those Unicode characters in category "L"), or the Unicode character FULLWIDTH LOW LINE (U+FF3F).

IdeographicCharacter ::

An IdentifierIdeographic may be a Unicode character with the ideographic property. The ideographic property is an informative property of the Compatibility Han characters, the Unified Han Set, and Hangzhou-style numerals, and the IDEOGRAPHIC NUMBER ZERO.

DecimalDigit :: one of

0 1 2 3 4 5 6 7 8 9

Additionally, a DecimalDigit may be a member of the Unicode decimal number class (those Unicode characters in category "Nd".

§15.9.1 Regular Expression Pattern Matching

The textual descriptions of the \w and \W character classes do not match with the character ranges given. The ranges given are what is intended (for historical reasons).

Amend the descriptions of \w and \W character classes:

\w ASCII letters, digits, and underscore; equivalent to "[a-zA-Z0-9_]".

\w Any character not an ASCII letter, digit, or underscore; equivalent to "[^a-zA-Z0-9_]".

Written by Michael Ang <mang@subcarrier.org>
Comments to Norris Boyd <norris@netscape.com>
Last modified: Fri Dec 18 18:58:34 PST 1998