~daisy-pluckers/oops-repository/trunk : contents of DESIGN.txt at revision 131

~daisy-pluckers/oops-repository/trunk : (revision 131)
============================
OOPS Repository design notes
============================

Design goals
============

OOPS Repository is intended to scale up to 1 million OOPS reports a day (and
possibly further). This is based on a 1% soft failre rate needing collection.

It needs to supports an extensible model, aggregation, automated garbage
collection, emitting messages for trend and fault detection systems and finally
realtime insertion and display of individual OOPSes.

Components
==========

Cassandra
---------

Cassandra was chosen because of the drop-dead simple method for increasing
write and read bandwidth available in the system.

OOPS Model
==========

An OOPS is an abstract server fault report. An OOPS has a mandatory identifier
assigned by the creator of the OOPS. An OOPS also has a json collection of
attributes, all of which are stored in the repostory, and some of which are
treated specially by front ends and reports. See the Schema for the attributes
known to the repository. While OOPSes are indexed by time in the repository,
and may have a datestamp, the system always indexes by the time the OOPS is
received rather than when it [may] have been generated. This is to simplify
the requirements for clients (they don't need to generate a datestamp or have
syncronised clocks).

The OOPS ID must be unique - one suggested way to generate them is to hash the
json of the fault report.

Schema
======

OOPS : Individual OOPSes are in this column family.
  row key : the oops ID supplied by the inserter
  mandatory columns:
    'date': LONG Used to build an index for garbage collection.
  optional known columns (all strings):
    'bug.*': Maps to bugs.
    'HTTP.*': HTTP variables. e.g. HTTP.method is PUT/POST/GET etc. 
    'REQUEST.*': arbitrary request variables.
    'context': The context for the fault report. E.g. a page template,
               particular API call - that sort of thing.
    'exception': The class of the exception causing the fault.
    'URL': The URL of the request.
    'username': the username.
    'userid': A database id for the user.
    'branch': Source code branch for the server.
    'revision': Revision of the server.
    'duration': The duration of the request in ms.
    'timeline': A json sequence describing the actions taken during the
        request. This may be split out to a separate CF in future. For now
        an example would be [{"start":"0", "length": "34", "database": "main",
            "statment":"SELECT ...", "callstack": "...."}, {....} ]

Summaries : various aggregates, kept up to date during inserts to permit live queries.
  row key : period - currently a iso8601 day - e.g. '20110227'
  columns: 'duration' | 'statement count' | 'volume.*' 
  For 'duration', the value is the N longest oopses inserted during the day. Writers write at Q and readback to ensure its consistent. [[duration, id], ...]
  For statement count, likewise [[count, id]].
  For volume., each column contains the occurences for one aggregate:
    frequency.context:exception -> count
    Once the period is over, the volumes are replaced with a single top-N list.

DayOOPS : day -> oops mappping
  row key : period - currently a iso8601 day - e.g. '20110227'
  columns : TimeUUID ->  OOPS ID, [future, suggested by mdennis: minuteinday,
            with a range index]