~ols-jenkaas-admins/ols-jenkaas/trunk : contents of README.rst at revision 688

~ols-jenkaas-admins/ols-jenkaas/trunk : (revision 688)
Architecture
============

ols-jenkaas is composed of a jenkins master and slaves providing lxd
containers.

All test jobs occur inside an lxd container, only the push to launchpad
occur on the slaves themselves so that the needed credentials are never
exposed to test jobs. The only exception is when jobs use docker, in which case
the tests are run inside the docker containers but docker itself is run
in the host.

Another container is used to create and update the jobs from a version
controlled branch.


Conventions
===========

The ``{jenkaas}.name`` notation used in this file and the project
documentation refers to the configuration options in ``ols-vms.conf``.

There are two namespaces, ``testing`` and ``production`` to define the
corresponding jenkaas setups, so, for example, ``{jenkaas}.slaves`` refers
to the slaves and will resolve to ``production.slaves`` or
``testing.slaves`` depending on the context.


Production
==========

ols-jenkaas is deployed in production at:
https://jenkins.ols.canonical.com/online-services/

The internal access (requires VPN) to create/update jobs is at:
http://online-services-jenkins-be.internal:8080/online-services/

The people administering ols-jenkaas are:
https://launchpad.net/~ubuntuone-hackers

The bot used to create/update the jobs is: https://launchpad.net/~ols-jb-bot

The bot owning the trunk and the release branches is:
https://launchpad.net/~otto-copilot

Depending on your participation in the various teams, you may or not have
the needed credentials to administer the jenkaas instances, create or update
jobs, run them, approve branches and land them.

But everybody can setup a local jenkaas for testing purposes (see `Test
setup`_).

More administrative details available in `doc/admin.rst`.


Jobs
====

Sooo, jenkins is about running jobs.

The jobs for jenkaas are defined here and created/updated in jenkins with
jenkins-job-builder from the openstack project
(http://docs.openstack.org/infra/jenkins-job-builder/).

``ols-job-builder`` in ``ols-vms.conf`` is the container to use to create,
update and delete the jobs.

Setting up ``ols-job-builder`` requires the credentials for
``{{jenkaas}.bot.user}``. They need to be retrieved from the jenkins master.

Authenticating is a bit of a challenge, as ols-jb-bot can't get past
apache-openid. First, log into login.ubuntu.com as yourself, then visit
https://jenkins.ols.canonical.com/ to get a cookie for apache-openid. Once
there, log out of login.ubuntu.com, grab ols-jb-bot's credentials from enigma,
then hit Jenkins' "log in" button to log in as ols-jb-bot. From there, you can
go to the configure page for this user, reveal and copy the API token and
finally::

  $ ols-vms config ols-job-builder production.bot.password=<API Token>

You're now ready to create your job builder::

  $ ols-vms setup ols-job-builder

You're good to go, edit job descriptions under 'jobs/' then do:

ubuntu@ols-job-builder:~/jenkaas$ jenkins-jobs update --delete-old jobs/

Note: This a WIP and only applies to members of
https://launchpad.net/~ols-jenkaas-admins who have the needed credentials,
see `Test setup`_ for how to setup a test environment.


Test setup
==========

Pre-requisites::

  $ sudo add-apt-repository --update ppa:ubuntuone/ols-tests
  $ sudo apt-get install wget ssh lxd lxd-client python3-ols-vms bzr-olsvms

If you've never used lxd before::

  $ sudo adduser $USER lxd
  $ newgrp lxd
  $ sudo lxd init --auto
  $ sudo dpkg-reconfigure -p medium lxd

and follow the instructions to setup a bridge, nat'ed for ipv4, ipv6 is not
needed, zfs is optional but great.

Since the slaves interact with launchpad, they are configured via
{{jenkaas}.landing.user} for the launchpad login (and for commits from
{jenkaas}.landing.fullname and {jenkaas}.landing.email}).

The slaves also need an OAuth token and an ssh key pair (public and
private), the public key being uploaded to launchpad (TBD).

For tests, setting {{jenkaas}.landing.user},
{{jenkaas}.landing.fullname} and {{jenkaas}.landing.email} to yourself and
being a jenkins master admin is the most convenient. The job builder uses
the jenkins 'admin' user since the API token is already available.

To allow the slaves to access launchpad branches and your current jenkaas
branch, you need to generate a password-less key just above your branch::

  $ ssh-keygen -f ../jenkins@jenkins-slave-ols-testing -N '' -t rsa -b 4096 -C jenkins@jenkins-slave-ols-testing

Upload the public part to your launchpad account
https://launchpad.net/~/+editsshkeys.

Add the public part to your ~/.ssh/authorized_keys file.

Check that the pre-requisites are ok (report bugs if needed)::

  $ ./testing/pre-requisites

Run the script::
  
  $ ./testing/setup-jenkaas

This will end displaying the jenkins url, something like:

  Jenkins master is at http://192.168.0.xxx:8080
  
Note that the ``setup-jenkaas`` script uses ``ols-vms config`` commands to
update the ~/.config/ols-vms/ols-vms.conf file, *not* the ols-vms.conf file
in the current directory which is under version control.

This makes more sense when you look at the config output::

  $ ols-vms config ols-job-builder-testing

See `Jobs`_ for setting up the job builder before continuing.

``ols-job-builder`` should be configured for production while
``ols-job-builder-testing`` should be configured for local testing.

See `doc/secrets` if you need to deal with landings and jobs requiring
secrets.


Pending issues
==============

jenkins
=======

IRC bot reporting
-----------------

There is currently no failure reporting to appropriate irc channels.

jenkins UI
----------

- Developers should be able to see the workspace
- API access should be granted for managing views


jenkins slave
=============

upstart
-------

In at least one case, the slave failed to start with 'Error: Invalid or
corrupt jarfile /var/run/jenkins/slave.jar' in
/var/log/upstart/jenkins-slave.log

Indeed the file was empty.

This may be related to downloading it with 'wget -O' (which clobber the
output file *before* attempting a download) and attempting to do so when the
master is not up yet (and not retrying enough ?).

In any case, this needs to be fixed in the jenkins-slave package and tested
to confirm the above.

It looks like this happened on several jenkaas in production and may be
worth reporting if the diagnosis is correct and it can be made more
reliable.

ssh
---

Since the jenkins slave package is not available for xenial and has the
above failure mode, may be switching to ssh instead of jlnp to connect the
slaves would take care of both issues.

xenial vs trusty
----------------

We need xenial for:

- zfs (speed up worker creation down to seconds)
- snaps in lxd (unless we can get yakkety slaves ?)


views
-----

All views are managed manually through the jenkins UI. We need API access to
be able to create/update views specific to each project (including one for
jenkaas itself).

webhooks
--------

Receiving webhooks on jenkins requires writing some java code. There is an
existing plugin for github that could be used as a starting point (the
comments on the issues are not encouraging though :-/).

It would probably be simpler, cleaner and more reliable to just have a
python app to revceive the webhooks and trigger the jobs. See `brain`_.

job triggering
==============

Most jobs are triggered if some condition is verified. The 'trigger-X' jobs
runs every 5 mins.

There are two main event families we want to react to:

branch
------

If a branch is created or updated, it means a dev may want to run some
tests.

Only changes to known branches are handled for now.

review
------

If a review is created or updated, it means one or several devs agreed that
the associated branch should pass some tests (it may happen during the
review discussion or after a specific stage (top approved).


In both cases, launchpad (or github) webhooks provide such events. We used
to poll for those events for ubuntone, partly from tarmac, partly from
specific code. This caused races and created noise *by design*, time to move
on ;)

<cough> until we get webhooks, approved proposals are checked every 5 minutes.


brain
=====

Jenkins needs to stay as dumb as possible. The least it does, the best
chance there is it'll do it well and reliably.

That's the #1 reason to not use more plugins than strictly necessary.

jenkins runs jobs and keeps test results (rotated as needed).

The scheduling is: one job on one slave at a time, scaling slaves
horizontally enhance the ci bandwith.

This makes the jobs simpler to write: "I have the whole ressources" is
simpler than "I should share with many foreigners doing unknown things".

This also makes the scheduling simpler: one executor per slave. Done.

So the "brain" should be elsewhere.

All jobs can still be run manually so the brain can be down without blocking
the service.

It can receive webhooks from launchpad and github and there are plenty of
wsgi and flask repositories on github, for github.

$ xdg-open https://github.com/carlos-jenkins/python-github-webhooks https://github.com/razius/github-webhook-handler https://github.com/bloomberg/python-github-webhook/blob/master/github_webhook/webhook.py

there may be others...

http://eli.thegreenplace.net/2014/07/09/payload-server-in-python-3-for-github-webhooks
seems to capture the smallest implementation.

remote jobs
===========

Some tests happen on different CI sites, we may need to import them, trigger
them or react on their success or failures.