============================ OOPS Repository design notes ============================ Design goals ============ OOPS Repository is intended to scale up to 1 million OOPS reports a day (and possibly further). This is based on a 1% soft failre rate needing collection. It needs to supports an extensible model, aggregation, automated garbage collection, emitting messages for trend and fault detection systems and finally realtime insertion and display of individual OOPSes. Components ========== Cassandra --------- Cassandra was chosen because of the drop-dead simple method for increasing write and read bandwidth available in the system. Schema ====== OOPS : Individual OOPSes are in this column family. row key : the oops ID supplied by the inserter mandatory columns: 'date': LONG Used to build a secondary index for garbage collection. optional known columns (all strings): 'bug.*': Maps to bugs. 'HTTP.*': HTTP variables. e.g. HTTP.method is PUT/POST/GET etc. 'REQUEST.*': arbitrary request variables. 'context': The context for the fault report. E.g. a page template, particular API call - that sort of thing. 'exception': The exception causing the fault. 'URL': The URL of the request. 'username': the username. 'userid': A database id for the user. 'branch': Source code branch for the server 'revision': Revision of the server 'duration': The duration of the request 'timeline': A json sequence describing the actions taken during the request. This may be split out to a separate CF in future. For now an example would be [{"start":"0", "length": "34", "database": "main", "statment":"SELECT ...", "callstack": "...."}, {....} ]