~ara/autotest/automated-ubuntu-server-tests

Viewing all changes in revision 3877.

Committer: showard
Date: 2009-12-08 22:21:02 UTC
Revision ID: vcs-imports@canonical.com-20091208222102-ez7vs21i0o8obsv7

Make drone_manager track running processes counts using only the information passed in from the scheduler.  Currently it also uses process counts derived from "ps", but that is an unreliable source of information.  This improves accuracy and consistency and gives us full control over the process.

This involves a few primary changes:
* made the drone_manager track process counts with each PidfileId
* added method declare_process_count() for the scheduler to indicate the process count of a pidfile ID during recovery (in other cases, the DroneManager gets that info in execute_process())

Doing this involved some extensive refactorings.  Because the scheduler now needs to declare process counts during recovery, and because the AgentTasks are the entities that know about process counts, it made sense to move the bulk of the recovery process to the AgentTasks.  Changes for this include:
* converted a bunch of AgentTask instance variables to abstract methods, and added overriding implementations in subclasses as necessary
* added methods register_necessary_pidfiles() and recover() to AgentTasks, allowing them to perform recovery for themselves.  got rid of the recover_run_monitor() argument to AgentTasks as a result.
* changed recovery code to delegate most of the work to the AgentTasks.  The flow now looks like this: create all AgentTasks, call them to register pidfiles, call DroneManager to refresh pidfile contents, call AgentTasks to recover themselves, perform extra cleanup and error checking.  This simplified the Dispatcher somewhat, in my opinion, though there's room for more simplification.

Other changes include:
* removed DroneManager.get_process_for(), which was unused, as well as related code (include the DroneManager._processes structure)
* moved logic from HostQueueEntry.handle_host_failure to SpecialAgentTask._fail_queue_entry.  That was the only call site.
And some other bug fixes:
* eliminated some extra state from QueueTask
* fixed models.HostQueueEntry.execution_path(). It was returning the wrong value, but it was never used.
* eliminated some big chunks from monitor_db_unittest.  These broke from the refactorings described above and I deemed it not worthwhile to fix them up for the new code.  I checked and the total coverage was unaffected by deleting these chunks.

Signed-off-by: Steve Howard <showard@google.com>