~psivaa/uci-engine/lander-jenkins-plugin : contents of TRICKS at revision 415

~psivaa/uci-engine/lander-jenkins-plugin : (revision 415)
====================================================
Tricks used while debugging juju/juju=deployer stuff
====================================================

Setup a public IP for the juju state server
===========================================

Used when trying to allow juju status to be reached without settting up
~/.ssh/config to proxy via chinstrap.canonistack.com.

Declare a public ip address where we can connect to the juju state server on
canonistack via ssh.

$ OS_REGION_NAME=lcy02 nova list
+--------------------------------------+----------------------+--------+------------+-------------+--------------------------+
| ID                                   | Name                 | Status | Task State | Power State | Networks                 |
+--------------------------------------+----------------------+--------+------------+-------------+--------------------------+
| a5f350da-b576-4e4e-be7e-ee639cbb9683 | juju-lcy02-machine-0 | ACTIVE | None       | Running     | canonistack=10.55.32.186 |
+--------------------------------------+----------------------+--------+------------+-------------+--------------------------+

The ID above (a5f350da-b576-4e4e-be7e-ee639cbb9683) is the juju state server.

We need a public IP to reach it:

$ OS_REGION_NAME=lcy02 nova floating-ip-create
+---------------+-------------+----------+-------+
| Ip            | Instance Id | Fixed Ip | Pool  |
+---------------+-------------+----------+-------+
| 162.213.35.63 | None        | None     | lcy02 |
+---------------+-------------+----------+-------+

Assign that IP address to the juju state server:

$ OS_REGION_NAME=lcy02 nova add-floating-ip a5f350da-b576-4e4e-be7e-ee639cbb9683 162.213.35.63

This was used to try to get juju-deployer working but didn't work in the end
as canonistack has an known issue about public IPs routing from inside the
region.

This was diagnosed from, connecting to some juju instance:

$ ssh 10.55.32.111

ubuntu@juju-lcy02-machine-1:~$ cd /var/log/juju
ubuntu@juju-lcy02-machine-1:/var/log/juju$ tail machine-1.log

2013-12-17 17:52:07 INFO juju runner.go:245 worker: restarting "api" in 3s
2013-12-17 17:52:10 INFO juju runner.go:253 worker: start "api"
2013-12-17 17:52:10 INFO juju apiclient.go:111 state/api: dialing "wss://162.213.35.63:17070/"
2013-12-17 17:53:13 ERROR juju apiclient.go:116 state/api: websocket.Dial wss://162.213.35.63:17070/: dial tcp 162.213.35.63:17070: connection timed out

Which shows that canonistack can't reach a public IP from inside the cloud itself :-/

Make sure juju tears down stuff properly
========================================

After 'juju destroy-environment' make sure 'nova list' also comes back
empty.


amulet thinks it has charms in the current directory
====================================================

We need:
vila:~/ci/ubuntu-ci-services-itself/trunk :) $ ln -s precise/python-django django

OMG ! I can't connect with ssh to my nova fresh instances !!!
=============================================================

The symptom:

$ nova secgroup-list-rules default

The remedy:

$ nova secgroup-add-rule default tcp 22 22 0.0.0.0/0
$ nova secgroup-add-rule default icmp -1 -1 0.0.0.0/0

$ nova secgroup-list-rules default
+-------------+-----------+---------+-----------+--------------+
| IP Protocol | From Port | To Port | IP Range  | Source Group |
+-------------+-----------+---------+-----------+--------------+
| tcp         | 22        | 22      | 0.0.0.0/0 |              |
| icmp        | -1        | -1      | 0.0.0.0/0 |              |
+-------------+-----------+---------+-----------+--------------+

This has happened to psivaa and vila. When mentioned to elmo, he never heard
about a security group getting cleaned up that way but pointing a (gentle)
finger to juju which *may* have clean the wrong security group. If you
experience this, please share so we can hopefully diagnose properly.


tracking juju state server changes
==================================

Assuming credentials and a bootstrapped/sshtutle'd juju.

$ watch -d -n 3 juju status


the pretty juju gui
===================

Assuming credentials and a bootstrapped/sshtutle'd juju.

$ firefox https://<juju-0-IP>

Enter your ~/.juju/environments.yaml admin-secret as the password.

Accessing services from public addresses
========================================

Sometimes you want to share access to a component with another engineer for
debugging. The following steps will give that component a public IP address and
expose it to the wider Internet. Please be careful to clean up unused public
IPs. They are a scarce resource.

$ IP=$(euca-allocate-address)
$ juju status # note the IP of the service you're after, save it to SERVICE_IP.
$ euca-describe-instances | grep $SERVICE_IP # note the instance ID (i-xxxxxx).
$ euca-associate-address -i $INSTANCE_ID $IP
$ juju expose $SERVICE_NAME # Tell juju to allow access.

Getting a Python stacktrace from a running service
==================================================

Nearly all the Python-based services in this project have a SIGQUIT handler to
dump a Python stack trace. To trigger this, run:

kill -SIGQUIT $PROCESS_ID

Where $PROCESS_ID is the PID of the process you want a stack trace for. The
trace will be written to stderr. If this is an upstart-spawned service, this
can be found in /var/log/upstart/$service.log.

Iterating on a charm in a deployed environment
==============================================

If the charm doesn't exist under charms/precise, you'll need to branch it
there. Then:

$ JUJU_REPOSITORY=charms juju upgrade-charm ts-django

This will cause the upgrade-charm hook to fire. It will not cause the install
hook, config-changed hook, or any relation hooks to fire. The charm should call
these under upgrade-charm as needed.

To fire the relations, just upgrade the charm, then remove and add back the
relation.

If the upgrade fails because of a transient problem, you can try it again.

$ juju resolved --retry ts-django/0

If the upgrade fails because of a bug in your code, you can iterate in this
loop.

# Tell juju to upgrade the charm, even if it's in a failed state.
$ JUJU_REPOSITORY=charms JUJU_ENV=lcy2 juju upgrade-charm ts-gunicorn --force

# Retry the failed upgrade.
$ JUJU_REPOSITORY=charms JUJU_ENV=lcy2 juju resolved --retry ts-gunicorn/0

Deploy and test single component
================================

If you want to test a single component without having to deploy the entire
Airline.

export CI_PPA=ppa:canonical-ci-engineering/ci-airline-phase-0
export CI_BRANCH=lp:ubuntu-ci-services-itself
export CI_CODE_SOURCE=branch

cheetah fill --env juju-deployer/configs/unit_config.yaml.tmpl -p > juju-deployer/configs/unit_config.yaml
cheetah fill --env juju-deployer/ticket-system.yaml.tmpl -p > juju-deployer/ticket-system.yaml

# Build charmhelpers where applicable.
for x in charms/precise/*; do (cd $x; [ -e Makefile ] && make); done

JUJU_REPOSITORY=$(readlink -e charms) juju-deployer -v -W -c juju-deployer/ticket-system.yaml ci-airline-staging

Cleaning up security groups after using juju-deployer -T
========================================================

When you terminate a machine in Juju, it does not reuse the machine identifier.
This ends up creating a lot of security groups unless you have set
'firewall-mode: global' in your environments.yaml. You can safely clean these
up at any time as nova will not delete security groups that are in use.

for x in $(nova secgroup-list | tail -n +4 |awk '{ print $2 }' | grep -v '^$'); do nova secgroup-delete $x; done

Working on multiple regions/clouds at the same time
===================================================

 - Use tmux, screen, or just multiple terminal emulators to have a window open
   for each region/cloud. It's also adviseable to split the window with one
   side running a watch of the created instances.

   $ watch 'nova list --fields networks,status 2>/dev/null'

 - In each window, source the novarc or ~/.hpcloud-rc for that region/cloud.

 - In each window, export JUJU_ENV=$ENV, where $ENV is an environment created
   in environments.yaml for that region. This is needed for Canonistack because
   Swift is currently shared across regions, so you need a different control
   bucket name.

 - In each window, but only for each Canonistack region, set up sshuttles for
   each subnet on that region.

   For lcy01:

   $ sshuttle -D -r ubuntu@10.55.60.77 10.55.60.0/16 -e "ssh -o UserKnownHostsFile=/dev/null"
   $ sshuttle -D -r ubuntu@10.55.60.77 10.55.61.0/16 -e "ssh -o UserKnownHostsFile=/dev/null"

   For lcy02:

   $ sshuttle -D -r ubuntu@10.55.32.219 10.55.32.0/16 -e "ssh -o UserKnownHostsFile=/dev/null"

Cleaning up public IPs on HP Cloud
==================================

Juju does not automatically delete public IP addresses when they're no longer
in use. These need to be cleaned up before running deploy.py or you may run
out, which will cause Juju to fail mid-deployment.

$ for x in $(nova secgroup-list | tail -n +5 |awk '{ print $2 }' | grep -v '^$'); do nova secgroup-delete $x; done

Making juju status useful again
===============================

Our deployment is now too big to use 'watch -n 5 juju status', but 'juju status' is smart:

$ juju status bsb-worker
environment: hp
machines:
  "21":
    agent-state: started
    agent-version: 1.17.4
    dns-name: 15.125.100.252
    instance-id: 305ddc5b-aa77-4d1f-9087-780f8335f6ba
    instance-state: ACTIVE
    series: precise
    hardware: arch=amd64 cpu-cores=1 mem=1024M root-disk=10240M
services:
  bsb-worker:
    charm: local:precise/rabbitmq-worker-0
    exposed: false
    relations:
      amqp:
      - rabbit
    units:
      bsb-worker/3:
        agent-state: started
        agent-version: 1.17.4
        machine: "21"
        public-address: 15.125.100.252

So 'watch -n 5 juju status <regexp>' can be use to monitor a subset of the
deployment and is useful learning tool about how machines, units and
services comes up.

Recovering from cloud provider failures in Juju
===============================================

Juju can encounter errors in some providers that it cannot recover from:

"2":
    agent-state-info: '(error: cannot allocate a public IP as needed: failed to list
      floating ips

      caused by: Maximum number of attempts (3) reached sending request to https://region-a.geo-1.compute.hpcloudsvc.com/v2/11597203075020/os-floating-ips\)'
    instance-id: pending
    series: precise

To avoid having to redeploy, you can remove the offending machine to be
provisioned and add a new unit in its place:

$ juju terminate-machine 2 --force
$ juju add-unit bsb-worker

This works because relations are independent of individual instances. A
relation can exist before any units do.

Testing hot fixes on a specific instance
========================================

If you want to test a specific fix for our code base (NOT a charm fix), you
can setup a branch on the instance and merge public branches into it.

Using the test runner as an example, the process is to:
- turn the payload into a branch,
- acquire additional branches,
- merge branches into the payload,
- restart the service.


Turning the payload into a branch
---------------------------------

Go to the instance and become root:

$ juju ssh tr-rabbit-worker/0 
$ sudo -H bash

Setup a bzr shared repo so we don't need to download the whole history for
each branch we'll use.

# bzr whoami "bugfixer <bugfixer@example.com>"
# cd /srv
# bzr init-repo .
# bzr branch lp:ubuntu-ci-services-itself trunk

Really turn the payload into a branch:

# cp -r trunk/.bzr tr_rabbit_worker/
# cd tr_rabbit_worker
# bzr st
# bzr commit -m 'Uploaded payload as of `date`'

From there you have a real branch and can use the usual commands to check
its state: 'bzr status', 'bzr diff'.

The expected output of 'bzr st' at that point is:

# bzr st
unknown:
  amqp_config.py
  unit_config

'amqp_config.py' has been generated by the rabbit charm, 'unit_config' comes
from the worker charm.

Acquire additional branches
---------------------------

$ cd /srv
$ bzr branch lp:~vila/ubuntu-ci-services-itself/hotfix

Merge branches into the payload
-------------------------------

$ cd /srv/tr_rabbit_worker
$ bzr merge ../hotfix

Restart the service
-------------------

$ initctl restart tr_rabbit_worker

And voila, you're now running the worker in your deployment with the hot fix
applied.


Dump the database for the Ticket System or PPA Assigner
=======================================================

Sometimes you want to poke at the brains of Django. Fortunately, our Postgres
database is small. You can dump the entire contents with the following command:

$ juju ssh ppa-postgres/0
$ sudo pg_dumpall -U postgres -a

The -a flag to pg_dumpall tells it to only dump the data, not the schema.

Pretty-printing JSON from the shell
===================================

JSON as a single line is difficult to read; however, it's pretty easy to
reformat it with the Python json module:

$ curl http://10.0.0.243:8080/api/v1/ppa/ | python -mjson.tool