 |
|
 |
|
grendel database requirements
by Terry Weissman
16-Oct-1997
Note: this does not include address book needs! Lester Schueler is working on a separate document
to cover our needs there, which is very different.
what we need to store.
Usually, when I hear people talking about using databases in Xena, they
seem to be talking about a convenient place to store a few dozen or a few
hundred objects. But Grendel is different in three ways:
- We already have a place to store the messages, and that's largely
outside of our control. Messages live on IMAP or NNTP servers. Or they
live in Berkeley mail folders. We could conceivably replace the Berkeley
mail folders with something else, but there are sound interoperability
reasons why we want to keep using them. Anyway, we can't replace the IMAP
or NNTP servers. So, we're not looking for a database to store the
messages themselves.
- What we need is a way to index the messages. That is,
given a set of criteria, we need to find the set of messages that satisfy
the criteria. We would expect the returned messages to be in the form of
pointers into the actual message store; we can then chase the pointers to
get the messages themselves. The database would contain enough
information to build a summary line for that message.
- We need to be able to store hundreds of thousands of messages. (If
you think that number is too high, then you aren't thinking about
news.)
Most traditional databases that I'm familiar with let you store records
of data. Each record consists of several fields. A few of the fields are
special "key" fields, that you can do fast searches on.
We can use this kind of traditional database, but it turns out every
field we store will need to be a key field, because we want to do sorting
and searching based on any of these fields.
And we have a lot of fields to store:
- Message-id
- Folder this message is stored in
- Date
- From
|
- List of "To" recipients
- List of "Cc" recipients
- List of "Bcc" recipients
- Subject
- References
|
- Priority
- Size
- Flags (read/unread, flagged, replied, forwarded, etc.)
- And I'm sure I'm missing some
|
Actually, the set of headers to store should probably be
user-customizable; some users (like jwz) will want to store every possible
header.
reliability needs.
Much or all of what we'll store in the database is just cached
information from the messages themselves. So, theoretically, if the
database blows up, we can recreate it. Practically, though, this would suck
a lot, as many users would have enough data that it will take hours or days
to recreate the data.
performance needs.
It's gotta fly. Both reading and writing have to be pretty much
instantaneous for the common cases. And the common cases are pretty broad.
We don't believe it is possible to write a database that has all the
indexing we need, and all the reliability we need, and still get all the
performance we need. So, we've figured out two dodges to help:
- Use 3.0-style summary files for the common case.
We won't even use a spiffy database for the usual case of ``show me
all the messages in this folder.'' Instead, we'll maintain a file for
each folder that contains all the info we need about that folder;
whenever the user opens up a folder, we inhale into memory the entire
contents of the relevant file. This is a proven technique that covers the
vast majority of common cases really well. It has some scaling problems,
though, and it definitely doesn't allow for the nifty cross-folder views
and searches that we really want to do. So, we still want a database to
handle those new features.
We would love to be proven wrong, and to just throw away the summary
file code and use the database for everything. But we are not yet
convinced that this will ever work, and so we're not prepared to count
on it.
- Update the database in a background thread.
(Any database gurus will probably laugh as I clumsily describe what
has got to be a standard database technique. If it's not a standard
technique, I deserve a patent.) Whenever changes are made that need to be
stored in the database, don't immediately commit them to the database,
because this is slow and will block the user from doing anything else
that requires the database until it completes. Instead, note them down in
a log file, and have a background thread incrementally commit the
changes. Any database queries would check for entries in the log file,
and merge in results from there. This technique results in database
changes to apparently happen instantaneously; the cost is that any
immediate queries will run slightly slower as they merge in the
uncommitted changes from the log file.
The first dodge helps us avoid the need for fast queries for the
database, but we'll still want it as fast as possible for the new features
that aren't handled well by summary files.
The second dodge helps us avoid the need for fast updates to the
database, but a slow database will still definitely suck in a lot of ways,
especially when the user does things like move thousands of messages from
one folder to another, or receives a ton of new mail, or imports an entire
new folder, or needs to rescan a whole folder (see below).
other gotchas.
We have to be able to quickly throw out and rebuild large chunks of
data, because at any time we may detect that everything we once knew about
a folder is suddenly invalid. If another application has changed an IMAP
folder or a Berkeley mail folder, we can detect the fact that a change
happened, but we can't know what changed. We have no choice but to throw
out everything in the database that relates to the folder, and recreate
it. Just the "throwing out" part can be a real expensive operation on many
databases.
Another nasty consequence of this is that it means the database is
probably not a good place to write down any extra information about the
messages. It's tempting to put annotations and extra status information
solely in the database, without writing it in anywhere in the real message
itself. But because folders can be changed out from under us without
warning, it is also tempting to consider the entire database as just a
cache, where anything can be thrown out and recreated at will. These goals
tend to be conflicting.
fitting in with RDF.
One bit of good news is that RDF's view of data is a model that works
well for our needs. A really fast database that directly implements the RDF
model can be directly used by our stuff. The main lack there would be the
ability to sort the resulting query in any way; as near as I can tell, RDF
does not support sorting of results. But I think we can live without the
database sorting its results.
truly ambitious stuff.
We haven't thought a lot about it, but we'd love a database that would
support full body text indexing on messages. Yow.
|
|
 |