============================ OOPS Repository design notes ============================ Design goals ============ OOPS Repository is intended to scale up to 1 million OOPS reports a day (and possibly further). This is based on a 1% soft failre rate needing collection. It needs to supports an extensible model, aggregation, automated garbage collection, emitting messages for trend and fault detection systems and finally realtime insertion and display of individual OOPSes. Components ========== Cassandra --------- Cassandra was chosen because of the drop-dead simple method for increasing write and read bandwidth available in the system. OOPS Model ========== An OOPS is an abstract server fault report. An OOPS has a mandatory identifier assigned by the creator of the OOPS. An OOPS also has a json collection of attributes, all of which are stored in the repostory, and some of which are treated specially by front ends and reports. See the Schema for the attributes known to the repository. While OOPSes are indexed by time in the repository, and may have a datestamp, the system always indexes by the time the OOPS is received rather than when it [may] have been generated. This is to simplify the requirements for clients (they don't need to generate a datestamp or have syncronised clocks). The OOPS ID must be unique - one suggested way to generate them is to hash the json of the fault report. Schema ====== OOPS : Individual OOPSes are in this column family. row key : the oops ID supplied by the inserter mandatory columns: 'date': LONG Used to build an index for garbage collection. optional known columns (all strings): 'bug.*': Maps to bugs. 'HTTP.*': HTTP variables. e.g. HTTP.method is PUT/POST/GET etc. 'REQUEST.*': arbitrary request variables. 'context': The context for the fault report. E.g. a page template, particular API call - that sort of thing. 'exception': The class of the exception causing the fault. 'URL': The URL of the request. 'username': the username. 'userid': A database id for the user. 'branch': Source code branch for the server. 'revision': Revision of the server. 'duration': The duration of the request in ms. 'timeline': A json sequence describing the actions taken during the request. This may be split out to a separate CF in future. For now an example would be [{"start":"0", "length": "34", "database": "main", "statment":"SELECT ...", "callstack": "...."}, {....} ] Summaries : various aggregates, kept up to date during inserts to permit live queries. row key : period - currently a iso8601 day - e.g. '20110227' columns: 'duration' | 'statement count' | 'volume.*' For 'duration', the value is the N longest oopses inserted during the day. Writers write at Q and readback to ensure its consistent. [[duration, id], ...] For statement count, likewise [[count, id]]. For volume., each column contains the occurences for one aggregate: frequency.context:exception -> count Once the period is over, the volumes are replaced with a single top-N list. DayOOPS : day -> oops mappping row key : period - currently a iso8601 day - e.g. '20110227' columns : TimeUUID -> OOPS ID, [future, suggested by mdennis: minuteinday, with a range index]