4
In juju, some things have "lifecycles". We sometimes refer to these things as
5
"entities", but I think that term is becoming overloaded to mean "thing that
6
can connect to state", ie unit and machine agents (and then presumably, one day,
7
users). So we don't have a good name for the group, but happily there are ony 4:
14
...and there are only 4 possible states for the above things:
21
There are two fundamental truths in this system:
23
* All such things start existence Alive.
24
* No such thing can ever change to an earlier state.
26
...and just about all else is tricky detail, that is worth examining from a few
27
different angles. First of all, it's worth considering the things that happen to
28
each state object separately, insofar as that is possible:
33
* Like everything else, a machine starts out Alive.
34
* While it is Alive, principal units can be assigned to it.
35
* When no principal units are assigned to it, it can be destroyed with
36
`juju destroy-machine`. This will set the machine to Dying, but will
37
fail when units are assigned. (Future plans: allow a machine to become
38
Dying when it has principal units, so long as they are not Alive. For now
39
it's extra complexity with little direct benefit.)
40
* Once a machine has been set to Dying, the corresponding Machine Agent (MA)
41
is responsible for setting it to Dead. (Future plans: when Dying units are
42
assigned, wait for them to become Dead and remove them completely before
43
making the machine Dead; not an issue now because the machine can't yet
44
become Dying with units assigned.)
45
* When the Provisioning Agent (PA) observes that the machine has become Dead,
46
it releases the underlying instance back to the provider and removes the
47
machine object from state. (Future uncertainty: should the PA provision an
48
instance for a Dying machine? At the moment, no, because a Dying machine
49
can't have any units in the first place; in the future, er, maybe, because
50
those Dying units may be attached to persistent storage and should thus be
51
allowed to continue to shut down cleanly as they would usually do. Maybe.)
56
* While a principal unit is alive, it can be assigned to machines and can have
57
subordinate units added. When subordinate units are added, it is responsible
58
for them in exactly the same
59
* A unit can become Dying at any time, but may not become Dead while any unit
60
subordinate to it exists, or when the unit is in scope for any relation.
61
(Future plans: `juju destroy-unit --force`, which will clean up a unit and
62
all its subordinates and relations.)
63
* A unit can become Dying in one of two ways:
65
* `juju destroy-service` (This happens indirectly: the Unit Agents (UAs)
66
for each unit of a service set their corresponding units to Dying when
67
they detect service Dying; this is because we try to assume 100k-scale
68
and we can't use mgo/txn to do a bulk update of 100k units because that
69
makes for a transaction with at least 100k operations and that's just
71
* When a principal unit becomes Dying, all its subordinates must also become
73
* When a unit is Dying, its UA is responsible for removing impediments to the
74
unit becoming Dead, and then making it so. To do so, the UA must:
75
* Depart from all its relations in an orderly fashion.
76
* Wait for all its subordinates to become Dead, and remove them from state.
77
* Set its unit to Dead.
78
* As just noted, when a subordinate unit is Dead, it is removed from state by
79
its principal's UA; similarly, when a principal unit is Dead, it is removed
80
from state by the MA of the machine to which it is assigned.
85
* Unlike units and machines, services have no corresponding agent.
86
* In addition, services are never Dead: they can only be Alive or Dying (or
87
removed, if you want to consider that a life state).
88
* When a service is Alive, units may be added to it, and the service's
89
endpoints can be added torelations.
90
* A service can become dying at any time, via `juju destroy-service`. This
91
causes all the units to become Dying, as discussed above, and will also
92
cause all relations in which the service is participating to become Dying.
93
(Contentious? I *think* python blocks service removal until relations are
95
* If no associated units or relations exist when a service is destroyed, the
96
service is removed from state immediately rather than being set to Dying.
97
* If no associated relations exist, the service is removed by the MA which
98
removes the last unit of that service from state.
99
* If no units of the service remain, but its relations still exist, the
100
responsibility for removing the service falls to the last UA to leave scope
101
for that relation. (Yes, this is a UA for a unit of a totally different
107
* A relation, like a service, has no corresponding agent and cannot be Dead.
108
* While a relation is Alive, units of services in that relation can enter its
110
* A relation can become Dying via either `juju destroy-relation` or
111
`juju destroy-service`. (Maybe; see services, above.)
112
* When a relation is destroyed with no units in scope, it will be removed
113
from state immediately rather than being set to Dying.
114
* When a relation becomes Dying, the UAs of units that have entered its scope
115
are responsible for cleanly departing the relation by running hooks and then
116
leaving relation scope.
117
* When the last unit leaves the scope of a Dying relation, it must remove the
119
* As noted above, the Dying relation may be the only thing keeping a Dying
120
service (different to that of the acting UA) from removal; so, relation
121
removal may also imply service removal.
126
OK, that was a bit of a hail of bullets, and the motivations for the above are
127
perhaps not always clear. To consider it from another angle:
129
* Subordinate units reference principal units.
130
* Principal units reference machines.
131
* All units reference their services.
132
* All units reference the relations whose scopes they have joined.
133
* All relations reference the services they are part of.
135
In every case above, where X references Y, the life state of an X may be
136
sufficient to prevent a change in the life state of a Y; and, conversely, a
137
life change in an X may be sufficient to cause a life change in a Y. (In only
138
one case does the reverse hold -- that is, setting a service to Dying will
139
cause its units' agents to individually set the units to Dying -- and this is
140
just an implementation detail.)
142
The following scrawl may help you to visualize the references in play:
144
+-----------+ +---------+
145
+-->| principal |------>| machine |
146
| +-----------+ +---------+
151
| +----------+ +---------+
152
| | relation |------>| service |
153
| +----------+ +---------+
162
...but is important to remember that it's only one view of the relationships
163
involved, and that the user-centric view is quite different; from a user's
164
perspective the influences appear to travel in the opposite direction:
166
* (destroying a machine "would" destroy its principals but that's disallowed)
167
* destroying a principal destroys all its subordinates
168
* (destroying a subordinate directly is impossible)
169
* destroying a service destroys all its units and relations
170
* destroying a container relation destroys all subordinates in the relation
171
* (destroying a global relation destroys nothing else)
173
...and it takes a combination of these viewpoints to understand the detailed
174
interactions laid out above.
179
It may also be instructive to consider the responsibilities of the unit and
180
machine agents. The unit agent is responsible for:
182
* detecting its service's Dying state, and setting its own Dying state.
183
* detecting its own Dying state, and:
184
* leaving relation scopes
185
* possibly removing those relations
186
* possibly removing some of those relations' services
187
* waiting for its subordinates' Dead states and
188
* removing the subordinates
189
* possibly removing their services
190
* finally setting its own Dead state
192
The machine agent, meanwhile, is responsible for:
194
* detecting its principals' Dead states and:
195
* removing the principals
196
* possibly removing their services
197
* detecting its machine's Dying state, and setting it to Dead.
199
Finally, the provisioning agent is responsible for:
201
* detecting machines' Dead states, and removing the machines.
203
(Oh: it shoud be noted explicitly that the PA is responsible for provisioning
204
Alive machines, that an MA is responsible for deploying Alive principals, that
205
a principal UA is responsible for creating and deploying its Alive subordinates,
206
and that all UAs are responsible for entering relevant relation scopes; but that
207
side is easy, and the what-do-we-do-with-a-Dying-object problem STM to have the
208
same tradeoffs regardless of object: the problem is still, and always, about
209
how we shoud treat Dying units backed by persistent storage, and the answer
210
remains easy... for now.)
215
We use the mgo/txn package for easy multi-document transactions against MongoDB.
216
Without this, it would be very difficult to enforce the various conditions
217
described above (consider, for example, the race between one client assigning a
218
new principal to a machine that another client is trying to destroy). However,
219
even with mgo/txn doing an awful lot of heavy lifting, actually landing a
220
consistent transaction occasionally demands a somewhat taxing level of attention
223
At the moment, the only type whose lifecycle has been implemented approximately
224
correctly is relation... and even that's incomplete, because RelationUnit's
225
LeaveScope method fails to take into account the possibility that it might be
226
responsible for destroying a service as well as just a relation. But, I guess,
227
we haven't started looking at service desctruction yet, so this is maybe not so
230
So -- IMO, none of this has yet been implemented to an adequate standard, and
231
we're not short of detailed work to do. I'm currently focused on subordinates,
232
which will demand correct implementations of:
234
* relation destruction (done)
235
* service destruction (not started)
236
* unit destruction (not started)
238
...and if we're doing all those we really should do machine destruction too,
239
especially considering it'll be easiest, but I'm not even going to think about
240
that until i've dealt with the rest.
242
Implementation issues
243
---------------------
245
As mentioned, we save ourselves some hassle by restricting machines to become
246
Dying only when they have no units (rather than when none of their units are
247
Alive). Fixing this will probably be trivial (we have to solve almost exactly
248
the same problem for subordinate units wrt their principal), but since we don't
249
*have* to worry about it yet I'm ignoring it.
251
But there's a trickier one: as you will no doubt have observed, the unit is
252
mired deep in dependencies, and a non-functioning unit agent can potentially
253
paralyse whole chunks of the system [0]. So, we decided a while ago that we should
254
support forcible unit destruction: that it should be possible to set a unit to
257
However, the number of interactions between units and other objects is a little
258
bit daunting. In the worst case, forcibly destroying a unit and keeping state
259
consistent may require:
261
* that the unit leave several relation scopes
262
* that the unit remove several relations it had been in scope of
263
* that the unit remove some services on the other end of removed relations
267
* that all the same things happen for all the unit's subordinates.
269
To take a pessimistic view of the matter, "just" setting a unit to Dead may
270
require hundreds of txn operations in a pathological case. However, none of
271
the alternatives fill me with great joy; I am starting to feel that a big
272
ugly bottleneck txn (specifically to deal with known-pathological cases) may
273
in fact be the most reliable and comprehensible way to express the desired
274
state change. More discussions on this topic will most assuredly come to pass.
278
[0] Heh. Frankly, so can a paralysed MA. `juju destroy-machine --force`..? If unit
279
force destruction seemed complex, the machine case is even worse because machines
280
can (in theory, if not currently in practice) have several principal units. No, I
281
haven't figured out how to deal with a paralysed PA either...