New Layout: Parsing Engine

The parser is the first stage in the sequence of systems that interact in order for a browser to display HTML documents. In order for NGLayout to be successful, the parser must be fast, extensible and above all it must offer robust error handling.

The parsing engine in NGLayout has a modular design that actually permits the system to parse almost any kind of data. (Of course the engine is optimized for HTML).

Conceptually speaking, a parsing "engine" is used to transform a source document from one form into another. In the case of HTML, the parser transforms the hierarchy of HTML tags (the source form) into a form that the underlying layout and display engine requires (the target form).

Major Components

The parsing engine provides a set of components which serve in the transformation process as a document moves from source to target form. We refer to these objects as components because they are combined dynamically a runtime to achive the transformation. By substituting a different set of components, you can perform alternate transformations.

Scanner Component
The first major component in the parsing engine is the Scanner. The Scanner provides an incremental "push-based" API that offers methods for accessing characters in the input stream (usually a URL), finding particular sequences, collating input data and skipping over unwanted data. Our experience has shown than a fairly simple scanner can be used effectively to parse everything from HTML and XML to C++.

Parser Component
The second major element in the system is the parser component itself. The parser component controls and coordinates the activities the other components in the system. This approach relies upon the fact that regardless of the form of the source document, the transformation process remains the same (as we'll explain later). While other components of the system are meant to be dynamically substituted according to the source document type, it is rarely necessary to alter the parser component.

The parser also drives tokenization. Tokenization refers to the process of coalating atomic units (characters) in the input stream into higher level structures called tokens. So for example, the HTML tokenizer converts a raw input stream of characters into HTML tags. For maximum flexibility, the tokenizer makes no assumptions about the underlying grammer. Instead, the details of the actual grammer being parsed is up to the DTD object that understands the constructs that comprise the grammar. The importance of this design decision is that it allows the engine to dynamically vary the language it is tokenizing without changing the tokenizer itself.

DTD Component
The final component in the parser engine is the DTD, which describes the rules for well-formed and/or valid documents in the target grammar. In HTML, the DTD declares and defines the tag set, the associated set of attributes and the hierarchical (nesting) rules of the HTML tags. Once again, by separating the DTD component from the other components in the parser engine it becomes possible to use the same system to parse a much wide range of document types. Simply put, this means that the same parser can provide input to the browser, biased (via the DTD) to behave like Navigator, IE, or any other HTML browser. The same can be said for XML.

Sink Component
Once the tokenization process is complete, the parse-engine needs to emit its content (tokens). Since the parser doesn't know anything about the document model, the containing application must provide a "content-sink". The sink is a simple API that accepts a container, leaf and text nodes, and constructs the underlying document model accordingly. The DTD interacts with the sink to cause the proper content-model to be constructed based on the input set of tokens.

While these objects may seem confusing at first, this simple diagram illustrates the runtime relationships between these objects:

<insert parser image here>

Implementation

Phase 1 -- Object Construction
Parsing a document is a straightforward operation. The containing application initiates the parse by creating a nIURL object, a nsTokenizer object and nsHTMLParse object. The parser is assigned a sink and a DTD (remember: the DTD understands the grammar of the document being parsed, while the sink interfaces allows the DTD to properly build a content model).

Phase 2 -- Opening an Input Stream
The parse process begins when the URL is opened, and content is provided in for the form of a network input stream. The stream is given to the scanner, which controls all access. The parse-engine then instructs the tokenizer to initiate the tokenization phase. Tokenization is an incremental process, and can interrupt when the scanner is blocked awaiting network data.

Phase 3 -- Tokenization
The tokenizer controls and coordinates the tokenization of the input stream into a collection of CTokens. (Different grammars will have their own subclasses of CToken, as we've done to create CHTMLToken, as well as their own iDTD). As the tokenizer runs, it repeatedly calls the method GetToken(). This continues until EOF occurs on the input stream, or an unrecoverable error occurs.

Phase 4 -- Token Iteration/Document Construction
After the tokenization phase completes, the parses enters the token iteration phase which validates the document and causes a content model to be constructed. Token iteration proceeds until an unrecoverable error occurs, or the parser has visited each token. The tokens are collected into related groups of information according to the rules provided by the nsDTD class. The DTD controls the order in which tokens can appear in relation to each other. At well defined times during this process, the parser notifies the content sink about the parse context, instructing the sink to construct the document according to the state of parser.

Phase 5 -- Object Destruction
Once tokenization and iteration have concluded, the objects in the parse system are destroyed to conserve memory.

Also Of Interest...

In addition to parsing of documents and dynamic DTD support, the parse engine also offers support for data i/o observers. The intention of these interfaces is to allow secure objects to hook into the i/o system at runtime, transforming the underlying stream before the parser see it. This can be useful in cases where preprocessing needs to occur, or where transforms from foreign document types into HTML should occur.

Dependencies

The parse engine is dependent upon the following classes/systems:

nsString
nsCore.h (and prtypes.h)
The XP_COM system
Netlib (for urls and input stream)

Roadmap

The next major improvements in the parser will focus on the following areas:

Support for well-formed and/or valid XML documents.
Support for document "processors" such as XSL and others.
Backward compatibility -- HTML DTD improvements.
Performance tuning.

Known Bugs

At this time, the DTD's are still work in progress (WIP). The are expected to improve incrementally over the next few months.