random pile of notes, mostly out of date for now... ------------------------------------------------------------ When walking the store, how do we tell if a message is already in the archive or not? Is using a digest enough, since there could be a message in two seperate folders with the same digest. We really need a foreign id: IMAP: UIDVALIDITY+":"+UID Exchange: ? ------------------------------------------------------------ High Level Architecture Server Components Master/Event Queuer/Archiver/Indexer/Searcher Technology Linux/Java/Apache/Tomcat/Lucene/MySQL Master Directory contains system-wide information, like where user's mailboxes are located, global configuration parameters, etc. replicated between boxes Event Queuer queues incoming events as quickly as possible to stable storage Archiver takes the queued events and adds them to the store also queues up an event for the Indexer looks at policy to determine if something is supposed to be archived? Indexer applies filter rules? only on initial indexing or always? indexes blobs of data MimeTyper - determines the mime type of a blob of data. given a mime type, locates a TextConvert TextExtractor - one for each mime type, knows how to convert a particular mime type into text. ObjectTyper - one for each object type ObjectExtractor - extracts "objects" from text, by running throgh all the enabled ObjectTypers MimeHandler - one for each mime type. Uses knowledge of mime type to get text and metadata about document, extract objects, and index the document. Searcher given a query, searches the archive and come with a list of matching blob ids, limited to only blob ids someone is authorized to see. Can also retrieve metadata about a blob (which might be stored in mysql or the index). Distributed Searcher Maintenance Roller - determines (via Policy) when stuff needs to be rolled off the archive (to other storage or just deleted) Reaper - garbage collects data that has no references: blobs, index entries, etc. Checker - checks consistency of DB and store? Combined with Reaper or separate more detailed pass? Balancer/Mover - moves a mailbox between boxes Backup/Restore - backups up all the archived mail and metadata to other storage in case of total failure of a box. Monitoring/Logging/Reporting WebServices - XML SOAP interface into box Event Queuing Archive Admin Management Box Admin Management Searching/Browsing Box-2-Box Database Schema Blob Store Layout Exchange/AD Integeration Appliance Infrastructure ---------- BlobManager - stores and retrieves blobs of data responsible for taking incoming messages and storing them. Handles sharing, attachment extraction, storing in a file bucket, etc. MailboxManager - associates blobs with a mailbox and a folder attribute Authenticator/Enforcer ------------------------------------------------------------ ------------------------------------------------------------ --------------------- 1. message comes in via web service a. data gets immediately queued to a queue file, which contains: the message content meta-data (which mailbox it is for, etc) 2. separate process runs through the queue files a. hash of message is compared against message store to see if message is already in store b. message gets parsed (attachments identified, decoded) c. system-wide filters get applied to message before it is added to store 3. message gets added to the message store, a. attachments (if policy match) are broken out and stored as blobs (identify mime-type of blobs) b. journal file gets updated, entry contains: destination mailbox id foreign message id/url message ref/blob refs 4. message ref gets added destination mailbox 5. process blob: get set of parts identify mime-type of blobs identify objects in blobs index message content/attachments in global index (senders, type of attachments, size, date, objects) ------------------------------------------------------------ if message is deleted instead of added: - mark as deleted in mailbox if message is moved to another folder instead of added: - mark as deleted in mailbox - add new message/folder to mailbox ---------------------------------------------------------------------- ---------------------------------------------------------------------- Messages PATH Lucene Fields from from header to to header cc cc header subject subject header date date header (for summary only?, needed along with l.date?) l.content concat of all the text parts + subject l.date lucene-ized date for searching l.size size (with leading 0's for range seraches or convert to hex?) *mail.attachments unique set of all attachement content types, or "none" if no attachments. l.type mime type (i.e., message/rfc822) l.blob_id database blob_id *l.domains stored compressed domain tree with special TokenStream that expands it into tokens? reverse the domain tokens for prefix searching? i.e., edu.stanford.lists to allow searching for "*stanford.edu" (maps to edu.stanford*) *l.objects list of objects like: "{type}=nnn:{data};...". Special analyzer will index only the unique list of object types. For example: phone=8:123-456;url=20:http://slashdot.org/;phone=3:911; will get tokenized to "phone,url" l.mbox_id mailbox ID l.mbox_blob_id mailbox_blob.ID in database l.partname hierarchical dotted-number name for MIME part (e.g. 2.1.3) l.thread_id which message thread does this message belong to? message-id "Message-ID" message header references "References" message header (non-indexed, non-tokenized, and only stored) any message ID info from In-Reply-To header is merged into this header * = special Analyzer to handle this field create a lucene document only for each attachment that is pulled out? content size (with leading 0's for range searches?) contentType (list of all enclosed attachment types) id DATABASE NOTES ------------------------------------------------------------ multipart/mixed 1 text/plain 2 text/plain 3 application/vnd.ms-powerpoint 4 application/octet-stream 5 application/msword multipart/mixed 1 text/plain 2 message/rfc822 2.1 multipart/mixed 2.1.1 text/plain 2.1.2 image/jpeg 2.1.3 image/jpeg multipart/mixed 1 multipart/alternative 1.1 text/plain 1.2 text/html 2 application/vnd.ms-excel 3 application/vnd.ms-powerpoint 4 application/msword ------------------------------------------------------------ event generator sends: mailbox name *hopefully mailbox foreign_id folder_name blob foreign_id blob name blob mime type? ------------------------------------------------------------ #------------------------------- SOAP 1.2 request POST /calc-service/soap/ HTTP/1.1 Host: localhost:8080 Content-Type: application/soap+xml; charset=utf-8 Content-Length: 243 1 2 #------------------------------- SOAP 1.2 response HTTP/1.1 200 OK content-length: 215 content-type: application/soap+xml;charset=utf-8 server: Apache-Coyote/1.1 date: Wed, 26 May 2004 01:51:43 GMT 3 #------------------------------- SOAP 1.1 request POST /calc-service/soap/ HTTP/1.1 Host: localhost:8080 SOAPAction: http://localhost:8080/calc-service/soap/ Content-Type: text/xml; charset=utf-8 Content-Length: 260 1 2 #------------------------------- SOAP 1.1 response HTTP/1.1 200 OK content-length: 232 content-type: text/xml;charset=utf-8 server: Apache-Coyote/1.1 date: Wed, 26 May 2004 01:54:12 GMT 3