The RDF data model is a directed, labeled graph; hence, the architecture for using and extending RDF within Mozilla is primarily one that is based around manipulating, composing, and extending graphs. This document is an overview of the Mozilla implementation of the RDF core engine, and will hopefully give you insight on how to use the RDF system as a client, as well as how to implement your own pluggable data sources.

Overview

Datatypes and Interfaces

How It Fits Together

Manipulating the RDF Back End

Querying the RDF Back End

Pluggable Data Sources

Disclaimer. This is a work in progress, placed up here under the assumption that something is better than nothing. Use with caution!

Overview

This section will give a brief but utilitarian overview of the RDF data model, primarily to establish some working terminology and relate these to their implementation in Mozilla. Beware that this is the "poor man's version" of the RDF data model: For a complete description of the RDF data model, please see the RDF Model and Syntax Specification and the RDF Schema Specification which are soon-to-be W3C standards.

Like any graph, the graph upon which RDF is based consists of nodes connected by arcs. An individual node in the RDF graph can either be an Internet resource that is identified by a Universal Resource Identifier (URI), or a simple string value. An individual arcs in an RDF graph is called a property: each property is itself a URI that is taken from a restricted set of URIs called a vocabulary. This is probably easiest illustrated with an example.

The following is a simple XML-serialized version of an RDF graph:

<rdf:RDF
  xmlns:rdf="http://www.w3.org/TR/WD-rdf-syntax#"
  xmlns:sm="http://www.mozilla.org/smart-mail/schema#">
 <rdf:Description
   about="http://www.mozilla.org/smart-mail/get-mail.cgi?\
user=waterson&folder=inbox">
   <sm:message resource="#402"
      sm:recipient="Chris Waterson \"waterson@netscape.com\""
      sm:sender="Aunt Mabel \"mabel@netcenter.net\""
      sm:subject="Great new recipe for Yam Soup">
    <sm:body>
     http://www.mozilla.org/smart-mail/get-body.cgi?id=402
    </sm:body>
   </sm:message>
 </rdf:Description>
</rdf:RDF>

The resulting graph looks something like this (forgive the ASCII art):

(http://www.mozilla.org/smart-mail/get-mail.cgi?\
 user=waterson&folder=inbox)
 |
 |
[http://www.mozilla.org/smart-mail/schema#message]
 |
 v
(http://www.mozilla.org/smart-mail/get-mail.cgi?\
 user=waterson&folder=inbox#402)
 |
 |
 +--[http://www.mozilla.org/smart-mail/schema#recipient]--+
 |                                                        |
 |                                                        v
 |                    ("Chris Waterson \"waterson@netscape.com\"")
 |
 +--[http://www.mozilla.org/smart-mail/schema#sender]--+
 |                                                     |
 |                                                     v
 |                          ("Aunt Mabel \"mabel@netcenter.net\"")
 |
 +--[http://www.mozilla.org/smart-mail/schema#subject]--+
 |                                                      |
 |                                                      v
 |                               ("Great new recipe for Yam Soup")
 |
 +--[http://www.mozilla.org/smart-mail/schema#body]--+
                                                     |
                                                     v
           (http://www.mozilla.org/smart-mail/get-body.cgi?id=402)

In the above diagram, nodes are enclosed in (parentheses), and properties are enclosed in [sqare brackets]. Nodes that contain simple string values are shown with double-quotes surrounding their values; e.g., ("Great new recipe for Yam Soup"). Nodes that are Internet resources show their URI values; e.g., (http://.../id=402).

There are a couple of important takeaways here:

Properties are URIs. I know I said that above, but it's easy to forget (and actually, the old Mozilla implementation of the RDF parser did forget this fact).
Properties come from a vocabulary, which is defined by a schema in an XML namespace. You can see this from the xmlns:sm="..." tag. This allows different applications to "understand" the properties to have some kind of semantics; e.g., we can all agree that http://www.mozilla.org/smart-mail/schema#body refers to the body of a SmartMail message (whatever that may be).
The graph allows arbitrary properties to be ascribed to URIs. These properties may just be string values, or they may actually be a relationship to another URI. Presumably, in the above example, the body refers to a CGI script that could actually fetch the body of the message for us.
There is a difference between the graph and the XML serialization of the graph. That this is important will become apparent in the following sections.

Datatypes and Interfaces

Okay, so where does the rubber meet the road? In Mozilla, there are several data structures that work together to present an RDF graph as a tenable data structure to the outside world. Here's an overview.

nsIRDFNode. This is an interface for a node in the RDF graph. One interface is used for both resource and literal nodes: nodes simply contain string values. You generally acquire a node from a resource manager using a string value.

nsIRDFResourceManager. A resource manager is responsible for keeping track of nsIRDFNode objects. It allows you to ask for a node using a string value, and must guarantee that the same node will always be returned for strings of equal value (to say this more precisely, if strcmp(s1, s2) == 0, then it must be the case that n1 == n2, where n1 and n2 are nsIRDFNode objects acquired from an nsIRDFResourceManager).

nsIRDFDataSource. This interface is implemented by a data source provider. A data source (like POP) collaborates in the creation of an "illusion of a graph" by implementing this interface. The interface includes methods that allow testing for the presence of an assertion, enumerating all of the properties (arcs) that are associated with (lead out of or in to) a resource, and so on.

Note that the nsIRDFDataSource interface is not directly exposed to the clients of the RDF subsystem: rather, individual data sources are aggregated together and presented via the nsIRDFDataBase interface, described below.

nsIRDFDataBase. This interface is used directly by clients of RDF to manipulate and query the RDF graph. Specifically, it is a strategy for composing the subgraphs that are presented by individual nsIRDFDataSource objects. For example, a "bookmarks" RDF database may consist of a local data source of personal bookmarks, a collection of sports bookmarks provided remotely by ESPN, and some standard, company-wide bookmarks that are loaded from your company's intranet server.

Note that the nsIRDFDataBase is derived from nsIRDFDataSource interface: this means that an nsIRDFDataBase can be queried just like an individual data source.

nsIRDFCursor. This is an iterator interface that is used to maintain result sets from certain nsIRDFDataSource and nsIRDFDataBase methods. Individual data sources may choose to implement nsIRDFCursor as an "eager" iterator that collects the results of a query "up front", or as a "lazy" iterator that evaluates the iteration as needed.

nsIRDFObserver. This is an interface that an RDF client implements. The interface allows a client to be notified when changes occur in the RDF graph.

How It Fits Together

Here's the 50,000 foot view of how it all hangs together. A client locates a particular RDF database using the XPCOM factory system: the result is an object that implements the nsIRDFDataBase interface. The implementor of this interface is responsible for correctly initializing and ordering the objects which implement the nsIRDFDataSource interface.

For example, the bookmarks "database" might consists of a local data source that contains personal bookmarks, a remote data source that contains general company-wide bookmarks, and another remote data source that collects the top stories from ESPN's web site.

Each individual data source "knows" how to talk with the real implementation to get things done. The data sources that we've implemented to date include*:

Appletalk
Bookmarks
Browser History
Cookies
FTP
LDAP
POP mail folders
Sitemaps
XML-serialized RDF

*Disclaimer: the data sources above have all been implemented using a C-based API, and are still undergoing surgery to bring them up to speed with the XPCOM interfaces described in this documents.

"Whoa!", I heard you just say. "XML-serialzed RDF is just a data source?" Yep. XML-serialized RDF (as described in the RDF Model and Syntax Specification) is dealt with using "just another" data source that knows how to tranlate XML-ized RDF into a graph model.

Here are some of the data sources that we'd still like to implement:

Local address book
IMAP mail folders
NNTP newsgroups and articles
Search engine interfaces

If you'd like to help work on these, or have other ideas, let us know.

Manipulating the RDF Back End

You should usually manipulate the RDF back end using the nsIRDFDataBase interface. This allows the database to handle delegation of requests to the appropriate data source.

Maniuplating RDF involves the following steps:

Acquire a database to manipulate
Acquire the resource manager
Acquire RDF resource nodes
Use the RDF nodes to make assertions in the database
Clean up

The code below illustrates this process.

Acquire a database to manipulate. To acquire an nsIRDFDataBase, use the XPCOM repository:

#include "nsRDFCID.h"
#include "nsRDFDataBase.h"
#include "nsRepository.h"

static NS_DEFINE_IID(kIRDFDataBaseIID,
                     NS_IRDFDATABASE_IID);

static NS_DEFINE_CID(kSimpleDataBaseCID,
                     NS_SIMPLEDATABASE_CID);

nsIRDFDataBase* db;
nsresult rv =
    nsRepository::CreateInstance(
        kSimpleDataBaseCID,
        NULL,
        kIRDFDataBaseIID,
        (void**) &db);

if (NS_FAILED(rv))
    // deal with error

// db now holds a valid RDF database.

Acquire the resource manager. To use the nsIRDFDataSource methods on db in the above example, you'll need to first acquire nsIRDFNode objects using the resource manager, which can be acquired from the XPCOM service manager, as illustrated below:

#include "nsServiceManager.h"
#include "nsIRDFResourceManager.h"

static NS_DEFINE_IID(kIRDFResourceManagerIID,
                     NS_IRDFRESOURCEMANAGER_IID);

static NS_DEFINE_CID(kRDFResourceManagerCID,
                     NS_RDFRESOURCEMANAGER_CID);

nsIRDFResourceManager* mgr;
nsresult rv =
    nsServiceManager::GetService(
        kRDFResourceManagerCID,
        kIRDFResourceManagerIID,
        (nsISupports*) &mgr);

if (NS_FAILED(rv))
    // deal with error

// mgr now holds the RDF resource manager

Acquire RDF resource nodes. Using the resource manager, you can acquire individual RDF resources and literals as nsIRDFNode objects. These are what you use to perform a query on the RDF database:

#include "nsIRDFNode.h"

// Get the RDF resources necessary to assert that I am
// the author of this page, using the Dublin Core schema
nsresult rv;
nsIRDFNode* page;
rv = mgr->GetNode("back-end-architecture.html", &page);
if (NS_FAILED(rv))
    // deal with error

nsIRDFNode* author;
rv = mgr->GetNode("http://purl.oclc.org/dc#Author", &author);
if (NS_FAILED(rv))
    // deal with error

nsIRDFNode* me;
rv = mgr->GetNode("Chris Waterson", &me);
if (NS_FAILED(rv))
    // deal with error

Use RDF nodes to make assertions in the database. And finally, we "do the deed" using the Assert method of the nsIRDFDataBase interface (actually, this method inheirited from the nsIRDFDataSource interface, for those who are groveling through source code):

// Finally, make the assertion:
if (NS_FAILED(db->Assert(page, author, me))
    // Uh oh, an error!

Clean up. Of course, you'll need to follow good XPCOM programming style and release all the resources that you've acquired once you're finished with them.

Querying the RDF Back End

Much of the code required to query the RDF back-end is identical to that required to manipulate it. Specifically, you need to:

Acquire a database to manipulate
Acquire the resource manager
Acquire RDF resource nodes
Use the RDF nodes to create a cursor
Iterate results from the cursor
Clean up

Steps 1, 2, 3, and 6 are described in detail above. Steps 4 and 5 are illustrated in detail, below.

Use the RDF nodes to create a cursor. The code below illustrates construction of a cursor that can be used to iterate through the arcs of an RDF graph.

#include "nsIRDFCursor.h"

// assuming that we've acquired appropriate
// nsIRDFNode objects, we'll use them to
// create a cursor that iterates the authors of
// the current web page.
nsIRDFCursor* cur;
nsresult rv =
    db->GetTargets(page, author, PR_TRUE, &cur);

if (NS_FAILED(rv))
    // deal with error

The above use of the GetTargets method returns a cursor that enumerates all of the nodes that appear at the end of arcs labeled with the author property that lead out of the page node.

Iterate results from the cursor. Once you've acquired a cursor, iterating through the values is easy.

while (1) {
    PRBool more;
    nsresult rv = cur->HasMoreElements(more);
    if (NS_FAILED(rv))
        // deal with error

    if (! more)
        break; // we're done!

    // get the next element
    nsIRDFNode* result;
    PRBool tv;
    rv = cur->GetNext(result, tv);
    if (NS_FAILED(rv))
        // deal with error

    // now result contains the next
    // node, and tv contains the truth-
    // value of the arc that leads to
    // result.
}

Clean up. There are two clean up tasks that are easy to forget to do with cursors. First, release each objects that you enumerate with the cursor using GetNext. Second, release the cursor object itself.

Pluggable Data Sources

The RDF back-end was designed with extensability in mind. This section describes pluggable data sources. Should this go into a separate doc?

There are two basic strategies you can take in designing your pluggable data source. The first strategy is involves constructing an in-memory copy of the data using an nsMemoryDataSource object, and to then delegate your implementation of the nsIRDFDataSource interface to the memory data source object. This approach works well if your data source is relatively small, can be manipulated only via the front-end, and you don't require fine-grained control over what happens "behind the scenes" when changes are made to the RDF graph. This strategy is relatively easy to implement, and involves constructing two translators: one to import data into the graph from the data source, and one to export data from the graph back to the data source. The "bookmarks" data source is implemented this way.

The second strategy involves directly implementing the nsIRDFDataSource interface. This tends to work better if your data source is too large to fit in memory, changes are frequently made to the contents of the data source "behind the scenes", or you need to control exactly what happens when an assertion is added to or removed from the graph. This strategy requires more care to implement, as it requires you to create the "graph illusion" yourself. This is probably how the IMAP data source will be implemented.

Open Issues

There are some open issues here. How do we register pluggable data sources? How does a database "know" that it needs to import a pluggable data source's subgraph?