![]() |
|
Proposal for extending the use of Unicode in ECMAScript identifiers
Written by Michael Ang <mang@subcarrier.org> Comments to Norris Boyd <norris@netscape.com> I. BackgroundECMAScript identifiers are currently specified as being Unicode. However, only the first 128 Unicode characters are allowed, effectively restricting identifiers to ASCII. Implementations of ECMAScript are currently in use around the world. Developers whose native language is not English should be able to have identifiers that make sense to them. Although arbitrary strings can be used for named property lookup, allowing ideographs and other Unicode characters in identifiers will make it easier for global developers to write scripts. Since implementations must currently accept Unicode characters, extending the range of characters allowed to that of the Unicode identifier class should not be an undue burden. Java guarantees that escaped Unicode characters occurring in source code (in the form \uNNNN) will be unescaped before compilation. This can lead to problems in dynamic languages, for example when a newline character is escaped:
Program 1 (note that \u000A is the newline character):
Program 2 (equivalent in Java, but not ECMAScript): Because allowing Unicode escapes in identifiers would complicate interpreter implementations, this is forbidden. Note that Unicode escapes are still allowed in comments and literal strings, but are not decoded. Section 5.14 of the Unicode Standard v2.0 gives implementation guidelines for identifiers. Most identifiers legal under these guidelines are legal in ECMAScript. ECMAScript differs in that no provisions are given for ignoring formatting characters (which are forbidden). II. Recommendations
These recommendations are made against the April 22 ECMAScript draft. Specific changes to the document appear in bold type. §6 Source Text
Amend the first section as follows: §7.5 Identifiers
Amend the first section as follows: Two ECMAScript identifiers are the same only if they have the same sequence of Unicode characters (as defined by their Unicode code points). This means that two identifiers with the same external appearance may not be identical. Composite Unicode characters are treated as distinct from their decomposed equivalents. For example, LATIN CAPITAL LETTER A (\u0061) followed by COMBINING RING ABOVE (\u030A) is distinct from LATIN CAPITAL LETTER A WITH RING ABOVE (\u00C5)."
The Unicode Standard v2.0 specifies implementation guidlines for identifiers
(§5.14 Identifiers). These significant differences between ECMAScript
and these guidelines should be noted:
Amend the BNF as follows:
"IdentifierName ::
IdeographicCharacter IdentifierName CombiningCharacter IdentifierName Extender IdentifierName IdeographicCharacter IdentifierName IdentifierLetter IdentifierName DecimalDigit
CombiningCharacter
Extender
IdentifierLetter :: one of [ASCII table with _ and $] Additionally, an IdentifierLetter may be a member of the Unicode letter class (those Unicode characters in category "L"), or the Unicode character FULLWIDTH LOW LINE (U+FF3F). IdeographicCharacter ::
DecimalDigit :: one of
§15.9.1 Regular Expression Pattern Matching The textual descriptions of the \w and \W character classes do not match with the character ranges given. The ranges given are what is intended (for historical reasons).
Amend the descriptions of \w and \W character classes:
Written by Michael Ang <mang@subcarrier.org> Comments to Norris Boyd <norris@netscape.com> Last modified: Fri Dec 18 18:58:34 PST 1998 |
|||||||||||
| Copyright © 1998-1999 The Mozilla Organization. | |||||||||||