random pile of notes, mostly out of date for now...
------------------------------------------------------------
When walking the store, how do we tell if a message is already in the
archive or not? Is using a digest enough, since there could be a message
in two seperate folders with the same digest.
We really need a foreign id:
IMAP: UIDVALIDITY+":"+UID
Exchange: ?
------------------------------------------------------------
High Level
Architecture
Server Components
Master/Event Queuer/Archiver/Indexer/Searcher
Technology
Linux/Java/Apache/Tomcat/Lucene/MySQL
Master Directory
contains system-wide information, like where user's mailboxes
are located, global configuration parameters, etc. replicated
between boxes
Event Queuer
queues incoming events as quickly as possible to stable storage
Archiver
takes the queued events and adds them to the store
also queues up an event for the Indexer
looks at policy to determine if something is supposed to be
archived?
Indexer
applies filter rules? only on initial indexing or always?
indexes blobs of data
MimeTyper - determines the mime type of a blob of data.
given a mime type, locates a TextConvert
TextExtractor - one for each mime type, knows how to convert a particular
mime type into text.
ObjectTyper - one for each object type
ObjectExtractor - extracts "objects" from text, by running throgh all
the enabled ObjectTypers
MimeHandler - one for each mime type. Uses knowledge of mime type
to get text and metadata about document, extract
objects, and index the document.
Searcher
given a query, searches the archive and come with
a list of matching blob ids, limited to only blob ids
someone is authorized to see. Can also retrieve metadata
about a blob (which might be stored in mysql or the index).
Distributed Searcher
Maintenance
Roller - determines (via Policy) when stuff needs to be rolled off
the archive (to other storage or just deleted)
Reaper - garbage collects data that has no references: blobs, index entries,
etc.
Checker - checks consistency of DB and store? Combined with Reaper or
separate more detailed pass?
Balancer/Mover - moves a mailbox between boxes
Backup/Restore - backups up all the archived mail and metadata
to other storage in case of total failure of a box.
Monitoring/Logging/Reporting
WebServices - XML SOAP interface into box
Event Queuing
Archive Admin Management
Box Admin Management
Searching/Browsing
Box-2-Box
Database Schema
Blob Store Layout
Exchange/AD Integeration
Appliance Infrastructure
----------
BlobManager - stores and retrieves blobs of data
responsible for taking incoming messages and storing them.
Handles sharing, attachment extraction, storing in a file bucket, etc.
MailboxManager - associates blobs with a mailbox and a folder attribute
Authenticator/Enforcer
------------------------------------------------------------
------------------------------------------------------------
---------------------
1. message comes in via web service
a. data gets immediately queued to a queue file, which contains:
the message content
meta-data (which mailbox it is for, etc)
2. separate process runs through the queue files
a. hash of message is compared against message store to see if message
is already in store
b. message gets parsed (attachments identified, decoded)
c. system-wide filters get applied to message before it is added to store
3. message gets added to the message store,
a. attachments (if policy match) are broken out and stored
as blobs (identify mime-type of blobs)
b. journal file gets updated, entry contains:
destination mailbox id
foreign message id/url
message ref/blob refs
4. message ref gets added destination mailbox
5. process blob:
get set of parts
identify mime-type of blobs
identify objects in blobs
index message content/attachments in global index
(senders, type of attachments, size, date, objects)
------------------------------------------------------------
if message is deleted instead of added:
- mark as deleted in mailbox
if message is moved to another folder instead of added:
- mark as deleted in mailbox
- add new message/folder to mailbox
----------------------------------------------------------------------
----------------------------------------------------------------------
Messages
PATH
Lucene Fields
from from header
to to header
cc cc header
subject subject header
date date header (for summary only?, needed along with l.date?)
l.content concat of all the text parts + subject
l.date lucene-ized date for searching
l.size size (with leading 0's for range seraches or convert to hex?)
*mail.attachments unique set of all attachement content types, or "none" if no attachments.
l.type mime type (i.e., message/rfc822)
l.blob_id database blob_id
*l.domains stored compressed domain tree with special TokenStream that expands it into tokens?
reverse the domain tokens for prefix searching? i.e., edu.stanford.lists to
allow searching for "*stanford.edu" (maps to edu.stanford*)
*l.objects list of objects like: "{type}=nnn:{data};...". Special analyzer will
index only the unique list of object types. For example: phone=8:123-456;url=20:http://slashdot.org/;phone=3:911;
will get tokenized to "phone,url"
l.mbox_id mailbox ID
l.mbox_blob_id mailbox_blob.ID in database
l.partname hierarchical dotted-number name for MIME part (e.g. 2.1.3)
l.thread_id which message thread does this message belong to?
message-id "Message-ID" message header
references "References" message header (non-indexed, non-tokenized, and only stored)
any message ID info from In-Reply-To header is merged into this header
* = special Analyzer to handle this field
create a lucene document only for each attachment that is pulled out?
content
size (with leading 0's for range searches?)
contentType (list of all enclosed attachment types)
id
DATABASE
NOTES
------------------------------------------------------------
multipart/mixed
1 text/plain
2 text/plain
3 application/vnd.ms-powerpoint
4 application/octet-stream
5 application/msword
multipart/mixed
1 text/plain
2 message/rfc822
2.1 multipart/mixed
2.1.1 text/plain
2.1.2 image/jpeg
2.1.3 image/jpeg
multipart/mixed
1 multipart/alternative
1.1 text/plain
1.2 text/html
2 application/vnd.ms-excel
3 application/vnd.ms-powerpoint
4 application/msword
------------------------------------------------------------
event generator sends:
mailbox name
*hopefully mailbox foreign_id
folder_name
blob foreign_id
blob name
blob mime type?
------------------------------------------------------------
#------------------------------- SOAP 1.2 request
POST /calc-service/soap/ HTTP/1.1
Host: localhost:8080
Content-Type: application/soap+xml; charset=utf-8
Content-Length: 243
1
2
#------------------------------- SOAP 1.2 response
HTTP/1.1 200 OK
content-length: 215
content-type: application/soap+xml;charset=utf-8
server: Apache-Coyote/1.1
date: Wed, 26 May 2004 01:51:43 GMT
3
#------------------------------- SOAP 1.1 request
POST /calc-service/soap/ HTTP/1.1
Host: localhost:8080
SOAPAction: http://localhost:8080/calc-service/soap/
Content-Type: text/xml; charset=utf-8
Content-Length: 260
1
2
#------------------------------- SOAP 1.1 response
HTTP/1.1 200 OK
content-length: 232
content-type: text/xml;charset=utf-8
server: Apache-Coyote/1.1
date: Wed, 26 May 2004 01:54:12 GMT
3