~rogpeppe/juju-core/axwalk-lp1300889-disable-mongo-keyfile : revision 773.7.1

1

Lifecycles

2

==========

3

4

In juju, some things have "lifecycles". We sometimes refer to these things as

5

"entities", but I think that term is becoming overloaded to mean "thing that

6

can connect to state", ie unit and machine agents (and then presumably, one day,

7

users). So we don't have a good name for the group, but happily there are ony 4:

8

9

* Machines

10

* Units

11

* Services

12

* Relations

13

14

...and there are only 4 possible states for the above things:

15

16

* Alive

17

* Dying

18

* Dead

19

* (removed)

20

21

There are two fundamental truths in this system:

22

23

* All such things start existence Alive.

24

* No such thing can ever change to an earlier state.

25

26

...and just about all else is tricky detail, that is worth examining from a few

27

different angles. First of all, it's worth considering the things that happen to

28

each state object separately, insofar as that is possible:

29

30

Machines

31

--------

32

33

* Like everything else, a machine starts out Alive.

34

* While it is Alive, principal units can be assigned to it.

35

* When no principal units are assigned to it, it can be destroyed with

36

`juju destroy-machine`. This will set the machine to Dying, but will

37

fail when units are assigned. (Future plans: allow a machine to become

38

Dying when it has principal units, so long as they are not Alive. For now

39

it's extra complexity with little direct benefit.)

40

* Once a machine has been set to Dying, the corresponding Machine Agent (MA)

41

is responsible for setting it to Dead. (Future plans: when Dying units are

42

assigned, wait for them to become Dead and remove them completely before

43

making the machine Dead; not an issue now because the machine can't yet

44

become Dying with units assigned.)

45

* When the Provisioning Agent (PA) observes that the machine has become Dead,

46

it releases the underlying instance back to the provider and removes the

47

machine object from state. (Future uncertainty: should the PA provision an

48

instance for a Dying machine? At the moment, no, because a Dying machine

49

can't have any units in the first place; in the future, er, maybe, because

50

those Dying units may be attached to persistent storage and should thus be

51

allowed to continue to shut down cleanly as they would usually do. Maybe.)

52

53

Units

54

-----

55

56

* While a principal unit is alive, it can be assigned to machines and can have

57

subordinate units added. When subordinate units are added, it is responsible

58

for them in exactly the same

59

* A unit can become Dying at any time, but may not become Dead while any unit

60

subordinate to it exists, or when the unit is in scope for any relation.

61

(Future plans: `juju destroy-unit --force`, which will clean up a unit and

62

all its subordinates and relations.)

63

* A unit can become Dying in one of two ways:

64

* `juju destroy-unit`

65

* `juju destroy-service` (This happens indirectly: the Unit Agents (UAs)

66

for each unit of a service set their corresponding units to Dying when

67

they detect service Dying; this is because we try to assume 100k-scale

68

and we can't use mgo/txn to do a bulk update of 100k units because that

69

makes for a transaction with at least 100k operations and that's just

70

crazy.)

71

* When a principal unit becomes Dying, all its subordinates must also become

72

Dying.

73

* When a unit is Dying, its UA is responsible for removing impediments to the

74

unit becoming Dead, and then making it so. To do so, the UA must:

75

* Depart from all its relations in an orderly fashion.

76

* Wait for all its subordinates to become Dead, and remove them from state.

77

* Set its unit to Dead.

78

* As just noted, when a subordinate unit is Dead, it is removed from state by

79

its principal's UA; similarly, when a principal unit is Dead, it is removed

80

from state by the MA of the machine to which it is assigned.

81

82

Services

83

--------

84

85

* Unlike units and machines, services have no corresponding agent.

86

* In addition, services are never Dead: they can only be Alive or Dying (or

87

removed, if you want to consider that a life state).

88

* When a service is Alive, units may be added to it, and the service's

89

endpoints can be added torelations.

90

* A service can become dying at any time, via `juju destroy-service`. This

91

causes all the units to become Dying, as discussed above, and will also

92

cause all relations in which the service is participating to become Dying.

93

(Contentious? I *think* python blocks service removal until relations are

94

removed...)

95

* If no associated units or relations exist when a service is destroyed, the

96

service is removed from state immediately rather than being set to Dying.

97

* If no associated relations exist, the service is removed by the MA which

98

removes the last unit of that service from state.

99

* If no units of the service remain, but its relations still exist, the

100

responsibility for removing the service falls to the last UA to leave scope

101

for that relation. (Yes, this is a UA for a unit of a totally different

102

service.)

103

104

Relations

105

---------

106

107

* A relation, like a service, has no corresponding agent and cannot be Dead.

108

* While a relation is Alive, units of services in that relation can enter its

109

scope.

110

* A relation can become Dying via either `juju destroy-relation` or

111

`juju destroy-service`. (Maybe; see services, above.)

112

* When a relation is destroyed with no units in scope, it will be removed

113

from state immediately rather than being set to Dying.

114

* When a relation becomes Dying, the UAs of units that have entered its scope

115

are responsible for cleanly departing the relation by running hooks and then

116

leaving relation scope.

117

* When the last unit leaves the scope of a Dying relation, it must remove the

118

relation from state.

119

* As noted above, the Dying relation may be the only thing keeping a Dying

120

service (different to that of the acting UA) from removal; so, relation

121

removal may also imply service removal.

122

123

References

124

----------

125

126

OK, that was a bit of a hail of bullets, and the motivations for the above are

127

perhaps not always clear. To consider it from another angle:

128

129

* Subordinate units reference principal units.

130

* Principal units reference machines.

131

* All units reference their services.

132

* All units reference the relations whose scopes they have joined.

133

* All relations reference the services they are part of.

134

135

In every case above, where X references Y, the life state of an X may be

136

sufficient to prevent a change in the life state of a Y; and, conversely, a

137

life change in an X may be sufficient to cause a life change in a Y. (In only

138

one case does the reverse hold -- that is, setting a service to Dying will

139

cause its units' agents to individually set the units to Dying -- and this is

140

just an implementation detail.)

141

142

The following scrawl may help you to visualize the references in play:

143

144

+-----------+ +---------+

145

+-->| principal |------>| machine |

146

| +-----------+ +---------+

147

| | |

148

| | +--------------+

149

| | |

150

| V V

151

| +----------+ +---------+

152

153

| +----------+ +---------+

154

| A A

155

| | |

156

| | +--------------+

157

| | |

158

| +-------------+

159

+---| subordinate |

160

+-------------+

161

162

...but is important to remember that it's only one view of the relationships

163

involved, and that the user-centric view is quite different; from a user's

164

perspective the influences appear to travel in the opposite direction:

165

166

* (destroying a machine "would" destroy its principals but that's disallowed)

167

* destroying a principal destroys all its subordinates

168

* (destroying a subordinate directly is impossible)

169

* destroying a service destroys all its units and relations

170

* destroying a container relation destroys all subordinates in the relation

171

* (destroying a global relation destroys nothing else)

172

173

...and it takes a combination of these viewpoints to understand the detailed

174

interactions laid out above.

175

176

Agents

177

------

178

179

It may also be instructive to consider the responsibilities of the unit and

180

machine agents. The unit agent is responsible for:

181

182

* detecting its service's Dying state, and setting its own Dying state.

183

* detecting its own Dying state, and:

184

* leaving relation scopes

185

* possibly removing those relations

186

* possibly removing some of those relations' services

187

* waiting for its subordinates' Dead states and

188

* removing the subordinates

189

* possibly removing their services

190

* finally setting its own Dead state

191

192

The machine agent, meanwhile, is responsible for:

193

194

* detecting its principals' Dead states and:

195

* removing the principals

196

* possibly removing their services

197

* detecting its machine's Dying state, and setting it to Dead.

198

199

Finally, the provisioning agent is responsible for:

200

201

* detecting machines' Dead states, and removing the machines.

202

203

(Oh: it shoud be noted explicitly that the PA is responsible for provisioning

204

Alive machines, that an MA is responsible for deploying Alive principals, that

205

a principal UA is responsible for creating and deploying its Alive subordinates,

206

and that all UAs are responsible for entering relevant relation scopes; but that

207

side is easy, and the what-do-we-do-with-a-Dying-object problem STM to have the

208

same tradeoffs regardless of object: the problem is still, and always, about

209

how we shoud treat Dying units backed by persistent storage, and the answer

210

remains easy... for now.)

211

212

Implementation

213

--------------

214

215

We use the mgo/txn package for easy multi-document transactions against MongoDB.

216

Without this, it would be very difficult to enforce the various conditions

217

described above (consider, for example, the race between one client assigning a

218

new principal to a machine that another client is trying to destroy). However,

219

even with mgo/txn doing an awful lot of heavy lifting, actually landing a

220

consistent transaction occasionally demands a somewhat taxing level of attention

221

to detail.

222

223

At the moment, the only type whose lifecycle has been implemented approximately

224

correctly is relation... and even that's incomplete, because RelationUnit's

225

LeaveScope method fails to take into account the possibility that it might be

226

responsible for destroying a service as well as just a relation. But, I guess,

227

we haven't started looking at service desctruction yet, so this is maybe not so

228

bad.

229

230

So -- IMO, none of this has yet been implemented to an adequate standard, and

231

we're not short of detailed work to do. I'm currently focused on subordinates,

232

which will demand correct implementations of:

233

234

* relation destruction (done)

235

* service destruction (not started)

236

* unit destruction (not started)

237

238

...and if we're doing all those we really should do machine destruction too,

239

especially considering it'll be easiest, but I'm not even going to think about

240

that until i've dealt with the rest.

241

242

Implementation issues

243

---------------------

244

245

As mentioned, we save ourselves some hassle by restricting machines to become

246

Dying only when they have no units (rather than when none of their units are

247

Alive). Fixing this will probably be trivial (we have to solve almost exactly

248

the same problem for subordinate units wrt their principal), but since we don't

249

*have* to worry about it yet I'm ignoring it.

250

251

But there's a trickier one: as you will no doubt have observed, the unit is

252

mired deep in dependencies, and a non-functioning unit agent can potentially

253

paralyse whole chunks of the system [0]. So, we decided a while ago that we should

254

support forcible unit destruction: that it should be possible to set a unit to

255

Dead directly.

256

257

However, the number of interactions between units and other objects is a little

258

bit daunting. In the worst case, forcibly destroying a unit and keeping state

259

consistent may require:

260

261

* that the unit leave several relation scopes

262

* that the unit remove several relations it had been in scope of

263

* that the unit remove some services on the other end of removed relations

264

265

...*and*:

266

267

* that all the same things happen for all the unit's subordinates.

268

269

To take a pessimistic view of the matter, "just" setting a unit to Dead may

270

require hundreds of txn operations in a pathological case. However, none of

271

the alternatives fill me with great joy; I am starting to feel that a big

272

ugly bottleneck txn (specifically to deal with known-pathological cases) may

273

in fact be the most reliable and comprehensible way to express the desired

274

state change. More discussions on this topic will most assuredly come to pass.

275

276

-----

277

278

[0] Heh. Frankly, so can a paralysed MA. `juju destroy-machine --force`..? If unit

279

force destruction seemed complex, the machine case is even worse because machines

280

can (in theory, if not currently in practice) have several principal units. No, I

281

haven't figured out how to deal with a paralysed PA either...