|
New Layout: Parsing
Engine
Author: Rick
Gessner
Last update: 1May98
The parser is the first stage in the sequence
of systems that interact in order for a browser to display HTML documents. In
order for NGLayout to be successful, the parser must be fast, extensible and above
all it must offer robust error handling.
The parsing engine in NGLayout has a modular
design that actually permits the system to parse almost any kind of data. (Of
course the engine is optimized for HTML).
Conceptually speaking, a parsing "engine"
is used to transform a source document from one form into another. In the case
of HTML, the parser transforms the hierarchy of HTML tags (the source form)
into a form that the underlying layout and display engine requires (the target
form).
The parsing engine provides a set of
components which serve in the transformation process as a document moves
from source to target form. We refer to these objects as components because
they are combined dynamically a runtime to achive the transformation. By
substituting a different set of components, you can perform alternate transformations.
Scanner Component
The first major component in the parsing
engine is the Scanner. The Scanner provides an incremental "push-based"
API that offers methods for accessing characters in the input stream (usually
a URL), finding particular sequences, collating input data and skipping over
unwanted data. Our experience has shown than a fairly simple scanner can be
used effectively to parse everything from HTML and XML to C++.
Parser Component
The second major element in the
system is the parser component itself. The parser component controls and
coordinates the activities the other components in the system. This approach
relies upon the fact that regardless of the form of the source document,
the transformation process remains the same (as we'll explain later). While
other components of the system are meant to be dynamically substituted
according to the source document type, it is rarely necessary to alter
the parser component.
The parser also drives tokenization. Tokenization
refers to the process of coalating atomic units (characters) in the input stream
into higher level structures called tokens. So for example, the HTML
tokenizer converts a raw input stream of characters into HTML tags. For maximum
flexibility, the tokenizer makes no assumptions about the underlying grammer.
Instead, the details of the actual grammer being parsed is up to the DTD object
that understands the constructs that comprise the grammar. The importance of
this design decision is that it allows the engine to dynamically vary the language
it is tokenizing without changing the tokenizer itself.
DTD Component
The final component in the parser engine
is the DTD, which describes the rules for well-formed and/or valid documents
in the target grammar. In HTML, the DTD declares and defines the tag set, the
associated set of attributes and the hierarchical (nesting) rules of the HTML
tags. Once again, by separating the DTD component from the other components
in the parser engine it becomes possible to use the same system to parse a much
wide range of document types. Simply put, this means that the same parser can
provide input to the browser, biased (via the DTD) to behave like Navigator,
IE, or any other HTML browser. The same can be said for XML.
Sink Component
Once the tokenization process is complete,
the parse-engine needs to emit its content (tokens). Since the parser doesn't
know anything about the document model, the containing application must provide
a "content-sink". The sink is a simple API that accepts a container, leaf and
text nodes, and constructs the underlying document model accordingly. The DTD
interacts with the sink to cause the proper content-model to be constructed
based on the input set of tokens.
While these objects may seem confusing at
first, this simple diagram illustrates the runtime relationships between these
objects:
<insert
parser image here>
Phase 1 -- Object Construction
Parsing a document is a straightforward operation.
The containing application initiates the parse by creating a nIURL object, a nsTokenizer
object and nsHTMLParse object. The parser is assigned a sink and a DTD (remember:
the DTD understands the grammar of the document being parsed, while the sink interfaces
allows the DTD to properly build a content model).
Phase 2 -- Opening an Input Stream
The parse process begins when the URL is
opened, and content is provided in for the form of a network input stream. The
stream is given to the scanner, which controls all access. The parse-engine
then instructs the tokenizer to initiate the tokenization phase. Tokenization
is an incremental process, and can interrupt when the scanner is blocked awaiting
network data.
Phase 3 -- Tokenization
The tokenizer controls and coordinates the
tokenization of the input stream into a collection of CTokens. (Different grammars
will have their own subclasses of CToken, as we've done to create CHTMLToken,
as well as their own iDTD). As the tokenizer runs, it repeatedly calls the method
GetToken(). This continues until EOF occurs on the input stream, or an
unrecoverable error occurs.
Phase 4 -- Token Iteration/Document
Construction
After the tokenization phase completes, the
parses enters the token iteration phase which validates the document and causes
a content model to be constructed. Token iteration proceeds until an unrecoverable
error occurs, or the parser has visited each token. The tokens are collected
into related groups of information according to the rules provided by the nsDTD
class. The DTD controls the order in which tokens can appear in relation to
each other. At well defined times during this process, the parser notifies the
content sink about the parse context, instructing the sink to construct the
document according to the state of parser.
Phase 5 -- Object Destruction
Once tokenization and iteration have concluded,
the objects in the parse system are destroyed to conserve memory.
In addition to parsing of documents and
dynamic DTD support, the parse engine also offers support for data i/o observers.
The intention of these interfaces is to allow secure objects to hook into the
i/o system at runtime, transforming the underlying stream before the parser
see it. This can be useful in cases where preprocessing needs to occur, or where
transforms from foreign document types into HTML should occur.
The parse engine is dependent upon the
following classes/systems:
-
nsString
-
nsCore.h (and prtypes.h)
-
The XP_COM system
-
Netlib (for urls and input stream)
The next major improvements in the parser will
focus on the following areas:
- Support for well-formed and/or valid
XML documents.
-
Support
for document "processors" such as XSL and others.
- Backward compatibility -- HTML DTD improvements.
- Performance tuning.
At this time, the DTD's are still work
in progress (WIP). The are expected to improve incrementally over the next
few months.
|