~vila/byoci/trunk

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
Inspiration
===========

Six rules for setting up continuous integration systems
https://rhonabwy.com/2016/01/31/six-rules-for-setting-up-continuous-integration-systems/
captures a lot of the motivations for what is implemented here, below are a
few notes and ideas that fit in.


Rule #1: Keep all the logic under version control
=================================================

The only way to get a fully reproducible CI is to make its deployment fully
automated from version controlled sources.

Rule #2: Leave no test failure unnoticed and unfixed.
=====================================================

Also, sleep() calls and timeouts should be eagerly monitored, limited to the
bare minimum and carefully tuned.

Rule #3: Fast gates directly impact velocity and productivity
=============================================================

Making gates faster is /not/ achieved by running less tests but by running
them concurrently.

The longest test is the ultimate barrier, not the sum of all test run times.

Only the slowest test needs to be optimized and its workflow
streamlined. Most of the time, it comes down to provisioning the right
testbed.

Rule #4: Everything in CI can be reproduced locally
===================================================

CI itself should never fail.

That is, a developer should never be blocked by a bug in CI, nor should the
deployment pipeline.

That means every failure that can't be reproduced locally (i.e. outside
of CI) is a bug in CI.

Therefore, the same rule should apply to CI itself: every part should be
testable locally.

A fallout is that no single jenkins specific part should be needed to run
tests and builds.

Keeping it as simple as possible is not optional.

This was quite apparent in the old Ubuntu Engineering (and later Online
Services) CI where both root causes caused issues:

- devs were blocked by failures they couldn't reproduce locally,

- the CI infra couldn't be tested locally and needed constant care to stay
  up and running because of "emerging" bugs caused by several weaknesses in
  dependencies, external setup or complicated internal setup which where
  never fixed (and were'nt fixed because they couldn't be reproduced and
  therefore never diagnosed properly).

So: jenkins jobs should produce results from commands happening in
environments air-gaped from jenkins itself and kept under strict version
control.


Rule #5: Cascade fully automated builds and tests
=================================================

All manual interventions are technical debt.


Rule #6: Metrics
================

Many tests already capture meaningful metrics about basic operations: that's
the code they are exercising on precisely controlled environments.

The trends to process merge proposals, build assets, deploy servers and
others directly define the time between a dev proposing a fix or a feature
and the time it's available to users.

Across the deployment pipeline, other manual interventions can increase that
time though.