 |
|
 |
|
Character Set Converters
by Catalin Rotaru <cata@netscape.com>
Last Modified: 14/Dec/1998
Introduction
From an user point of view, a human-readable string is an array of characters.
But, in order to store this text in a computer, an encoding (character
set) must be used. Internally, NGLayout uses Unicode. However, a different
character set may be used by the page author, and a different one may be
used by the font author. So our system must be able to first convert data
from the input character set into the internal encoding (Unicode), and
then into the output character set in order to do the rendering. This is
what the Character Set Converters are for: convert data between various
encodings. One thing to keep in mind is that a character set is not a converter.
A character set is a name, a label for an encoding. A type, if you want.
A converter is a piece of code able to convert data between two different
encodings.
Design & Architecture
The Character Set Converter module contains 2 main components
-
The ConverterManager - implementing nsICharsetConverterManager
-
This guy is responsible with managing all those converters.
-
It will: solve charset aliases into cannonical names, maintain a mapping
between converters and the charsets they convert from and into, return
a list of all the encodings for which we have a converter, and so on.
-
The Converter(s) - implementing nsICharsetConverter
and its factory implementing nsICharsetConverterInfo
-
The converter converts between two character sets
-
The Charset Converter Info is a little description of the converter - which
charsets is it converting between.
Extensibility
Our main goal for the new model is to have full drop-in extensibility for
the converters and their corresponding charsets. That means that if an
user adds a plugin Converter(FooCharset => Unicode), that charset will
have full rights, for example it will apear in the [View.Character Set]
menu and the converter will be used to decode incoming data encoded in
the FooCharset.
The reason for this goal is that usually encodings
are grouped in a per-language basis. Instead of gathering all the known
converters and ship a converter library containing all known charstets
(this can get quite big in time!), we'd rather offer a basic distribution
containing the most used converters and per-language support throught SmartUpdate
or Plugins. This also give users the possibility to add converters for
the Foo legacy enconding, which is not known or used enough to be included
in a Netscape distribution.
XXX Further documentation to be added here as the extensibility mechanisms
are solved at XPCOM level.
High-level API
This API is expected to be used by most of the users. It should give very
easy access to the most common converters functions. It should be at the
stream level: for example something like new UnicodeInputStream(String
* aCharset), or new String(byte * aBuffer, String * aCharset). You get
the idea: type safety and all, simplicity - the Converter Manager is well
hidden under the hood, you can very well ignore it if you don't need the
extra functionality. Hell, you don't even know you are using a Converter!
XXX Further documentation to be added here as the high level API is
designed.
Low-level API
This API is the most powerful and the most general one. It gives you direct
access to the converters. The downside is that you must be extra careful
here with the data types, and you have to manage more complexity.
First you get a character set converter from the converter manager using
the following API:
/**
* Interface for a Manager of Charset
Converters.
*
* @created
17/Nov/1998
* @author Catalin Rotaru [CATA]
*/
class nsICharsetConverterManager : public
nsISupports
{
public:
/**
* Finds a Converter between
the source and the destination character
* sets.
*
* @param aSrc
[IN] the known name/alias of the source character set
* @param aDest [IN]
the known name/alias of the destination character set
* @param aResult [OUT] the character
set converter
* @return
NS_CONVERTER_NOT_FOUND if no converter was found for
*
these charsets
*/
NS_IMETHOD GetConverter(const nsString
* aSrc, const nsString * aDest,
nsICharsetConverter
** aResult) = 0;
/**
* Returns a list of character
sets for which we have converters (from the
* given charset into them).
*
* @param aCharset
[IN] the name/alias of the source character set
* @param aResult
[OUT] a NULL-terminated array of pointers to Strings
*/
NS_IMETHOD GetCharsetsConvertedFrom(const
nsString * aCharset,
nsString **
aResult) = 0;
/**
* Returns a list of character
sets for which we have converters (from them
* into the given charset).
*
* @param aCharset
[IN] the name/alias of the destination character set
* @param aResult
[OUT] a NULL-terminated array of pointers to Strings
*/
NS_IMETHOD GetCharsetsConvertedInto(const
nsString * aCharset,
nsString **
aResult) = 0;
/**
* Resolves the cannonical name
of a character set. If the given name is
* unknown to the resolver, a
new identical string will be returned! This
* way, new & unknown charsets
are not rejected and they are treated as
* no-aliases charsets.
*
* @param aCharset
[IN] the known name/alias of the character set
* @param aResult
[OUT] the cannonical name of the character set
*/
NS_IMETHOD GetCharsetName(const nsString
* aCharset,
nsString **
aResult) = 0;
};
Then you use the Converter with the following
API:
/**
* Interface for a Charset Converter.
*
* XXX Compare this interface with the
one from the C++ standard
*
* @created
23/Nov/1998
* @author Catalin Rotaru [CATA]
*/
class nsICharsetConverter : public nsISupports
{
public:
/**
* Converts the data from one
character set to another.
*
* @param aDest
[IN/OUT] the destination data buffer
* @param aDestOffset [IN] the
offset in the destination data buffer
* @param aDestLength [IN/OUT]
the length of destination data buffer; after
*
converstion will contain the number of bytes written
* @param aSrc
[IN] the source data buffer
* @param aSrcOffset [IN]
the offset in the source data buffer
* @param aSrcLength [IN/OUT]
the length of source data buffer; after
*
converstion will contain the number of bytes read
* @param finish
[IN] if this is the last buffer in this conversion;
*
the converter has the possibility to write some extra
*
data, flush its final state (but only if success!)
* @return
error code
*/
NS_IMETHOD Convert(char * aDest, PRInt32
aDestOffset, PRInt32 * aDestLength,
const char
* aSrc, PRInt32 aSrcOffset, PRInt32 * aSrcLength,
PRBool finish)
= 0;
/**
* Resets the charset converter
so it may be reused on a different buffer.
*/
NS_IMETHOD Reset() = 0;
};
The converter discovery mechanism uses the following
description API, which is implemented by the Converter factory:
/**
* Interface for getting the Charset
Converter information.
*
* @created
08/Dec/1998
* @author Catalin Rotaru [CATA]
*/
class nsICharsetConverterInfo : public nsISupports
{
public:
/**
* Returns the character set
this converter is converting from.
*
* @param aCharset
[OUT] a name/alias for the source charset
*/
NS_IMETHOD GetCharsetSrc(nsString
** aCharset) = 0;
/**
* Returns the character set
this converter is converting into.
*
* @param aCharset
[OUT] a name/alias for the destination charset
*/
NS_IMETHOD GetCharsetDest(nsString
** aCharset) = 0;
};
How to write and add a new Character Set Converter
XXX Further documentation to be added here as the API is freezed. Until
then, if you want to write a new converter, you can get almost all the
data you need from the source code! For the rest, please contact me, I'd
be more that happy to help and assist you.
Issues
1) Right now a charset is a string, a label. Should this be an interface
(ICharset)?
2) Right now the alias resolution service is done by the CharsetConverterManager.
Should this be in a different, independent service (CharsetManager)?
|
|
 |