1
// Copyright 2015 Canonical Ltd.
2
// Licensed under the AGPLv3, see LICENCE file for details.
6
The dependency package exists to address a general problem with shared resources
7
and the management of their lifetimes. Many kinds of software handle these issues
8
with more or less felicity, but it's particularly important that juju (which is
9
a distributed system that needs to be very fault-tolerant) handle them clearly
15
A cursory examination of the various workers run in juju agents (as of 2015-04-20)
16
reveals a distressing range of approaches to the shared resource problem. A
17
sampling of techniques (and their various problems) follows:
19
* enforce sharing in code structure, either directly via scoping or implicitly
20
via nested runners (state/api conns; agent config)
21
* code structure is inflexible, and it enforces strictly nested resource
22
lifetimes, which are not always adequate.
23
* just create N of them and hope it works out OK (environs)
24
* creating N prevents us from, e.g., using a single connection to an environ
25
and sanely rate-limiting ourselves.
26
* use filesystem locking across processes (machine execution lock)
27
* implementation sometimes flakes out, or is used improperly; and multiple
28
agents *are* a problem anyway, but even if we're all in-process we'll need
29
some shared machine lock...
30
* wrap workers to start up only when some condition is met (post-upgrade
31
stability -- itself also a shared resource)
32
* lifetime-nesting comments apply here again; *and* it makes it harder to
34
* implement a singleton (lease manager)
35
* singletons make it *even harder* to figure out what's going on -- they're
36
basically just fancy globals, and have all the associated problems with,
37
e.g. deadlocking due to unexpected shutdown order.
39
...but, of course, they all have their various advantages:
41
* Of the approaches, the first is the most reliable by far. Despite the
42
inflexibility, there's a clear and comprehensible model in play that has yet
43
to cause serious confusion: each worker is created with its resource(s)
44
directly available in code scope, and trusts that it will be restarted by an
45
independent watchdog if one of its dependencies fails. This characteristic is
46
extremely beneficial and must be preserved; we just need it to be more
49
* The create-N-Environs approach is valuable because it can be simply (if
50
inelegantly) integrated with its dependent worker, and a changed Environ
51
does not cause the whole dependent to fall over (unless the change is itself
52
bad). The former characteristic is a subtle trap (we shouldn't be baking
53
dependency-management complexity into the cores of our workers' select loops,
54
even if it is "simple" to do so), but the latter is important: in particular,
55
firewaller and provisioner are distressingly heavyweight workers and it would
56
be unwise to take an approach that led to them being restarted when not
59
* The filesystem locking just should not happen -- and we need to integrate the
60
unit and machine agents to eliminate it (and for other reasons too) so we
61
should give some thought to the fact that we'll be shuffling these dependencies
62
around pretty hard in the future. If the approach can make that task easier,
65
* The singleton is dangerous specifically because its dependency interactions are
66
unclear. Absolute clarity of dependencies, as provided by the nesting approaches,
67
is in fact critical; but the sheer convenience of the singleton is alluring, and
68
reminds us that the approach we take must remain easy to use.
70
The various nesting approaches give easy access to directly-available resources,
71
which is great, but will fail as soon as you have a sufficiently sophisticated
72
dependent that can operate usefully without all its dependencies being satisfied
73
(we have a couple of requirements for this in the unit agent right now). Still,
74
direct resource access *is* tremendously convenient, and we need some way to
75
access one service from another.
77
However, all of these resources are very different: for a solution that encompasses
78
them all, you kinda have to represent them as interface{} at some point, and that's
79
very risky re: clarity.
85
The package is intended to implement the following developer stories:
87
* As a developer trying to understand the codebase, I want to know what workers
88
are running in an agent at any given time.
89
* As a developer, I want to be prevented from introducing dependency cycles
91
* As a developer, I want to provide a service provided by some worker to one or
93
* As a developer, I want to write a service that consumes one or more other
95
* As a developer, I want to choose how I respond to missing dependencies.
96
* As a developer, I want to be able to inject test doubles for my dependencies.
97
* As a developer, I want control over how my service is exposed to others.
98
* As a developer, I don't want to have to typecast my dependencies from
100
* As a developer, I want my service to be restarted if its dependencies change.
102
That last one might bear a little bit of explanation: but I contend that it's the
103
only reliable approach to writing resilient services that compose sanely into a
104
comprehensible system. Consider:
106
* Juju agents' lifetimes must be assumed to exceed the MTBR of the systems
107
they're deployed on; you might naively think that hard reboots are "rare"...
108
but they're not. They really are just a feature of the terrain we have to
109
traverse. Therefore every worker *always* has to be capable of picking itself
110
back up from scratch and continuing sanely. That is, we're not imposing a new
111
expectation: we're just working within the existing constraints.
112
* While some workers are simple, some are decidedly not; when a worker has any
113
more complexity than "none" it is a Bad Idea to mix dependency-management
114
concerns into their core logic: it creates the sort of morass in which subtle
117
So, we take advantage of the expected bounce-resilience, and excise all dependency
118
management concerns from the existing ones... in favour of a system that bounces
119
workers slightly more often than before, and thus exercises those code paths more;
120
so, when there are bugs, we're more likely to shake them out in automated testing
121
before they hit users.
123
We'd maybe also like to implement this story:
125
* As a developer, I want to add and remove groups of workers atomically, e.g.
126
when starting the set of controller workers for a hosted environ; or when
127
starting the set of workers used by a single unit. [NOT DONE]
129
...but there's no urgent use case yet, and it's not certain to be superior to an
130
engine-nesting approach.
136
Run a single dependency.Engine at the top level of each agent; express every
137
shared resource, and every worker that uses one, as a dependency.Manifold; and
138
install them all into the top-level engine.
140
When installed under some name, a dependency.Manifold represents the features of
141
a node in the engine's dependency graph. It lists:
143
* The names of its dependencies (Inputs).
144
* How to create the worker representing the resource (Start).
145
* How (if at all) to expose the resource as a service to other resources that
146
know it by name (Output).
148
...and allows the developers of each independent service a common mechanism for
149
declaring and accessing their dependencies, and the ability to assume that they
150
will be restarted whenever there is a material change to their accessible
153
When the weight of manifolds in a single engine becomes inconvenient, group them
154
and run them inside nested dependency.Engines; the Report() method on the top-
155
level engine will collect information from (directly-) contained engines, so at
156
least there's still some observability; but there may also be call to pass
157
actual dependencies down from one engine to another, and that'll demand careful
164
In each worker package, write a `manifold.go` containing the following:
166
// ManifoldConfig holds the information necessary to configure the worker
167
// controlled by a Manifold.
168
type ManifoldConfig struct {
170
// The names of the various dependencies, e.g.
173
// Any other required top-level configuration, e.g.
177
// Manifold returns a manifold that controls the operation of a worker
178
// responsible for <things>, configured as supplied.
179
func Manifold(config ManifoldConfig) dependency.Manifold {
181
return dependency.Manifold{
183
// * certainly include each of your configured dependency names,
184
// getResource will only expose them if you declare them here.
185
Inputs: []string{config.APICallerName, config.MachineLockName},
187
// * certainly include a start func, it will panic if you don't.
188
Start: func(getResource dependency.GetResourceFunc) (worker.Worker, error) {
189
// You presumably want to get your dependencies, and you almost
190
// certainly want to be closed over `config`...
191
var apicaller base.APICaller
192
if err := getResource(config.APICallerName, &apicaller); err != nil {
195
return newSomethingWorker(apicaller, config.Period)
198
// * output func is not obligatory, and should be skipped if you
199
// don't know what you'll be exposing or to whom.
200
// * see `worker/gate`, `worker/util`, and
201
// `worker/dependency/testing` for examples of output funcs.
202
// * if you do supply an output func, be sure to document it on the
203
// Manifold func; for example:
205
// // Manifold exposes Foo and Bar resources, which can be
206
// // accessed by passing a *Foo or a *Bar in the output
207
// // parameter of its dependencies' getResouce calls.
212
...and take care to construct your manifolds *only* via that function; *all*
213
your dependencies *must* be declared in your ManifoldConfig, and *must* be
214
accessed via those names. Don't hardcode anything, please.
216
If you find yourself using the same manifold configuration in several places,
217
consider adding helpers to cmd/jujud/agent/engine, which includes mechanisms
218
for simple definition of manifolds that depend on an API caller; on an agent;
225
The `worker/dependency/testing` package, commonly imported as "dt", exposes a
226
`StubResource` that is helpful for testing `Start` funcs in decent isolation,
227
with mocked dependencies. Tests for `Inputs` and `Output` are generally pretty
228
specific to their precise context and don't seem to benefit much from
232
Special considerations
233
----------------------
235
The nodes in your *dependency* graph must be acyclic; this does not imply that
236
the *information flow* must be acyclic. Indeed, it is common for separate
237
components to need to synchronise their actions; but the implementation of
238
Engine makes it inconvenient for either one to depend on the other (and
239
impossible for both to do so).
241
When a set of manifolds need to encode a set of services whose information flow
242
is not acyclic, apparent A->B->A cycles can be broken by introducing a new
243
shared dependency C to mediate the information flow. That is, A and B can then
244
separately depend upon C; and C itself can start a degenerate worker that never
245
errors of its own accord.
247
For examples of this technique, search for `cmd/jujud/agent/engine.NewValueWorker`
248
(which is generally used inside other manifolds to pass snippets of agent config
249
down to workers that don't have a good reason to see, or write, the full agent
250
config); and `worker/gate.Manifold`, which is for one-way coordination between
251
workers which should not be started until some other worker has completed some
254
Please be careful when coordinating workers like this; the gate manifold in
255
particular is effectively just another lock, and it'd be trivial to construct
256
a set of gate-users that can deadlock one another. All the usual considerations
257
when working with locks still apply.
260
Concerns and mitigations thereof
261
--------------------------------
263
The dependency package will *not* provide the following features:
265
* Deterministic worker startup. As above, this is a blessing in disguise: if
266
your workers have a problem with this, they're using magical undeclared
267
dependencies and we get to see the inevitable bugs sooner.
268
TODO(fwereade): we should add fuzz to the bounce and restart durations to
269
more vigorously shake out the bugs...
270
* Hand-holding for developers writing Output funcs; the onus is on you to
271
document what you expose; produce useful error messages when they supplied
272
with unexpected types via the interface{} param; and NOT to panic. The onus
273
on your clients is only to read your docs and handle the errors you might