~ols-jenkaas-admins/ols-jenkaas/trunk : contents of MANIFESTO.rst at revision 688

~ols-jenkaas-admins/ols-jenkaas/trunk : (revision 688)

Inspiration
===========

Six rules for setting up continuous integration systems
https://rhonabwy.com/2016/01/31/six-rules-for-setting-up-continuous-integration-systems/
captures a lot of the motivations for what is implemented here, below are a
few notes and ideas that fit in.


Rule #1: Keep all the logic under version control
=================================================

The only way to get a fully reproducible CI is to make its deployment fully
automated from version controlled sources.

Rule #2: Leave no test failure unnoticed and unfixed.
=====================================================

Also, sleep() calls and timeouts should be eagerly monitored, limited to the
bare minimum and carefully tuned.

Rule #3: Fast gates directly impact velocity and productivity
=============================================================

Making gates faster is /not/ achieved by running less tests but by running
them concurrently.

The longest test is the ultimate barrier, not the sum of all test run times.

Only the slowest test needs to be optimized and its workflow
streamlined. Most of the time, it comes down to provisioning the right
testbed.

Rule #4: Everything in CI can be reproduced locally
===================================================

CI itself should never fail.

That is, a developer should never be blocked by a bug in CI, nor should the
deployment pipeline.

That means every failure that can't be reproduced locally (i.e. outside
of CI) is a bug in CI.

Therefore, the same rule should apply to CI itself: every part should be
testable locally.

A fallout is that no single jenkins specific part should be needed to run
tests and builds.

Keeping it as simple as possible is not optional.

This was quite apparent in the old UE CI where both root causes caused
issues:

- devs were blocked by failures they couldn't reproduce locally,

- the CI infra couldn't be tested locally and needed constant care to stay
  up and running because of "emerging" bugs caused by several weaknesses in
  dependencies, external setup or complicated internal setup which where
  never fixed (and were'nt fixed because they couldn't be reproduced and
  therefore never diagnosed properly).

So: jenkins jobs should produce results from commands happening in
envrionments air-gaped from jenkins itself and kept under strict version
control.


Rule #5: Cascade fully automated builds and tests
=================================================

All manual interventions are technical debt.


Rule #6: Metrics
================

Many tests already capture meaningful metrics about basic operations: that's
the code they are exercising on precisely controlled environments.

The trends to process merge proposals, build assets, deploy servers and
others directly define the time between a dev proposing a fix or a feature
and the time it's available to users.

Across the deployment pipeline, other manual interventions can increase that
time though.

75 by Vincent Ladeuil One more pass on the doc.	1	Inspiration
	2	===========
	3
	4	Six rules for setting up continuous integration systems
	5	https://rhonabwy.com/2016/01/31/six-rules-for-setting-up-continuous-integration-systems/
	6	captures a lot of the motivations for what is implemented here, below are a
	7	few notes and ideas that fit in.
	8
	9
	10	Rule #1: Keep all the logic under version control
	11	=================================================
	12
	13	The only way to get a fully reproducible CI is to make its deployment fully
	14	automated from version controlled sources.
	15
	16	Rule #2: Leave no test failure unnoticed and unfixed.
	17	=====================================================
	18
	19	Also, sleep() calls and timeouts should be eagerly monitored, limited to the
	20	bare minimum and carefully tuned.
	21
	22	Rule #3: Fast gates directly impact velocity and productivity
	23	=============================================================
	24
	25	Making gates faster is /not/ achieved by running less tests but by running
	26	them concurrently.
	27
	28	The longest test is the ultimate barrier, not the sum of all test run times.
	29
	30	Only the slowest test needs to be optimized and its workflow
	31	streamlined. Most of the time, it comes down to provisioning the right
	32	testbed.
	33
	34	Rule #4: Everything in CI can be reproduced locally
	35	===================================================
	36
	37	CI itself should never fail.
	38
	39	That is, a developer should never be blocked by a bug in CI, nor should the
	40	deployment pipeline.
	41
	42	That means every failure that can't be reproduced locally (i.e. outside
	43	of CI) is a bug in CI.
	44
	45	Therefore, the same rule should apply to CI itself: every part should be
	46	testable locally.
	47
	48	A fallout is that no single jenkins specific part should be needed to run
	49	tests and builds.
	50
	51	Keeping it as simple as possible is not optional.
	52
	53	This was quite apparent in the old UE CI where both root causes caused
	54	issues:
	55
	56	- devs were blocked by failures they couldn't reproduce locally,
	57
	58	- the CI infra couldn't be tested locally and needed constant care to stay
	59	up and running because of "emerging" bugs caused by several weaknesses in
	60	dependencies, external setup or complicated internal setup which where
	61	never fixed (and were'nt fixed because they couldn't be reproduced and
	62	therefore never diagnosed properly).
	63
	64	So: jenkins jobs should produce results from commands happening in
65	envrionments air-gaped from jenkins itself and kept under strict version
66	control.
67
68
69	Rule #5: Cascade fully automated builds and tests
70	=================================================
71
72	All manual interventions are technical debt.
73
74
75	Rule #6: Metrics
76	================
77
78	Many tests already capture meaningful metrics about basic operations: that's
79	the code they are exercising on precisely controlled environments.
80
81	The trends to process merge proposals, build assets, deploy servers and
82	others directly define the time between a dev proposing a fix or a feature
83	and the time it's available to users.
84
85	Across the deployment pipeline, other manual interventions can increase that
86	time though.