1
Note that much more extensive documentation is available in :ref:`query-ensembl`.
8
`Ensembl <http://www.ensembl.org>`_ provides access to their MySQL databases directly or users can download and run those databases on a local machine. To use the Ensembl's UK servers for running queries, nothing special needs to be done as this is the default setting for PyCogent's ``ensembl`` module. To use a different Ensembl installation, you create an account instance:
12
>>> from cogent.db.ensembl import HostAccount
13
>>> account = HostAccount('fastcomputer.topuni.edu', 'username',
16
To specify a specific port to connect to MySQL on:
20
>>> from cogent.db.ensembl import HostAccount
21
>>> account = HostAccount('fastcomputer.topuni.edu', 'dude',
22
... 'ucanthackthis', port=3306)
24
.. we create valid account now to work on my local machines here at ANU
30
>>> uname, passwd = os.environ['ENSEMBL_ACCOUNT'].split()
31
>>> account = HostAccount('cg.anu.edu.au', uname, passwd)
36
To see what existing species are available
40
>>> from cogent.db.ensembl import Species
42
================================================================================
43
Common Name Species Name Ensembl Db Prefix
44
--------------------------------------------------------------------------------
45
A.aegypti Aedes aegypti aedes_aegypti
46
Alpaca Vicugna pacos vicugna_pacos...
48
If Ensembl has added a new species which is not yet included in ``Species``, you can add it yourself.
52
>>> Species.amendSpecies('A latinname', 'a common name')
54
You can get the common name for a species
58
>>> Species.getCommonName('Procavia capensis')
61
and the Ensembl database name prefix which will be used for all databases for this species.
65
>>> Species.getEnsemblDbPrefix('Procavia capensis')
71
Find a gene by gene symbol
72
^^^^^^^^^^^^^^^^^^^^^^^^^^
74
We query for the *BRCA2* gene for humans.
78
>>> from cogent.db.ensembl import Genome
79
>>> human = Genome('human', Release=58, account=account)
81
Genome(Species='Homo sapiens'; Release='58')
82
>>> genes = human.getGenesMatching(Symbol='BRCA2')
83
>>> for gene in genes:
84
... if gene.Symbol == 'BRCA2':
87
Gene(Species='Homo sapiens'; BioType='protein_coding'; Description='breast cancer 2,...'; StableId='ENSG00000139618'; Status='KNOWN'; Symbol='BRCA2')
89
Find a gene by Ensembl Stable ID
90
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
92
We use the stable ID for *BRCA2*.
96
>>> from cogent.db.ensembl import Genome
97
>>> human = Genome('human', Release=58, account=account)
98
>>> gene = human.getGeneByStableId(StableId='ENSG00000139618')
100
Gene(Species='Homo sapiens'; BioType='protein_coding'; Description='breast cancer 2,...'; StableId='ENSG00000139618'; Status='KNOWN'; Symbol='BRCA2')
102
Find genes matching a description
103
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
105
We look for breast cancer related genes that are estrogen induced.
109
>>> from cogent.db.ensembl import Genome
110
>>> human = Genome('human', Release=58, account=account)
111
>>> genes = human.getGenesMatching(Description='breast cancer estrogen')
112
>>> for gene in genes:
114
Gene(Species='Homo sapiens'; BioType='protein_coding'; Description='breast cancer estrogen-induced...'; StableId='ENSG00000181097'; Status='KNOWN'; Symbol='AC105219.1')
116
Get canonical transcript for a gene
117
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
119
We get the canonical transcripts for *BRCA2*.
123
>>> from cogent.db.ensembl import Genome
124
>>> human = Genome('human', Release=58, account=account)
125
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
126
>>> transcript = brca2.CanonicalTranscript
128
Transcript(Species='Homo sapiens'; CoordName='13'; Start=32889610; End=32973347; length=83737; Strand='+')
130
Get the CDS for a transcript
131
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
135
>>> from cogent.db.ensembl import Genome
136
>>> human = Genome('human', Release=58, account=account)
137
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
138
>>> transcript = brca2.CanonicalTranscript
139
>>> cds = transcript.Cds
141
<class 'cogent.core.sequence.DnaSequence'>
143
ATGCCTATTGGATCCAAAGAGAGGCCA...
145
Look at all transcripts for a gene
146
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
150
>>> from cogent.db.ensembl import Genome
151
>>> human = Genome('human', Release=58, account=account)
152
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
153
>>> for transcript in brca2.Transcripts:
155
Transcript(Species='Homo sapiens'; CoordName='13'; Start=32889610; End=32973347; length=83737; Strand='+')
156
Transcript(Species='Homo sapiens'; CoordName='13'; Start=32953976; End=32972409; length=18433; Strand='+')
158
Get the first exon for a transcript
159
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
161
We show just for the canonical transcript.
165
>>> from cogent.db.ensembl import Genome
166
>>> human = Genome('human', Release=58, account=account)
167
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
168
>>> print brca2.CanonicalTranscript.Exons[0]
169
Exon(StableId=ENSE00001184784, Rank=1)
171
Get the introns for a transcript
172
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
174
We show just for the canonical transcript.
178
>>> from cogent.db.ensembl import Genome
179
>>> human = Genome('human', Release=58, account=account)
180
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
181
>>> for intron in brca2.CanonicalTranscript.Introns:
183
Intron(TranscriptId=ENST00000380152, Rank=1)
184
Intron(TranscriptId=ENST00000380152, Rank=2)
185
Intron(TranscriptId=ENST00000380152, Rank=3)...
188
Inspect the genomic coordinate for a feature
189
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
193
>>> from cogent.db.ensembl import Genome
194
>>> human = Genome('human', Release=58, account=account)
195
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
196
>>> print brca2.Location.CoordName
198
>>> print brca2.Location.Start
200
>>> print brca2.Location.Strand
203
Get repeat elements in a genomic interval
204
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
206
We query the genome for repeats within a specific coordinate range on chromosome 13.
210
>>> from cogent.db.ensembl import Genome
211
>>> human = Genome('human', Release=58, account=account)
212
>>> repeats = human.getFeatures(CoordName='13', Start=32879610, End=32889610, feature_types='repeat')
213
>>> for repeat in repeats:
214
... print repeat.RepeatClass
218
Repeat(CoordName='13'; Start=32879362; End=32879662; length=300; Strand='-', Score=2479.0)
220
Get CpG island elements in a genomic interval
221
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
223
We query the genome for CpG islands within a specific coordinate range on chromosome 11.
227
>>> from cogent.db.ensembl import Genome
228
>>> human = Genome('human', Release=58, account=account)
229
>>> islands = human.getFeatures(CoordName='11', Start=2150341, End=2170833, feature_types='cpg')
230
>>> for island in islands:
233
CpGisland(CoordName='11'; Start=2158951; End=2162484; length=3533; Strand='-', Score=3254.0)
241
We find the genetic variants for the canonical transcript of *BRCA2*.
243
.. note:: The output is significantly truncated!
247
>>> from cogent.db.ensembl import Genome
248
>>> human = Genome('human', Release=58, account=account)
249
>>> brca2 = human.getGeneByStableId(StableId='ENSG00000139618')
250
>>> transcript = brca2.CanonicalTranscript
251
>>> print transcript.Variants
252
(<cogent.db.ensembl.region.Variation object at ...
253
>>> for variant in transcript.Variants:
256
Variation(Symbol='rs55880202'; Effect='5PRIME_UTR'; Alleles='C/T')...
261
We get a single SNP and print it's allele frequencies.
265
>>> snp = list(human.getVariation(Symbol='rs34213141'))[0]
266
>>> print snp.AlleleFreqs
267
=============================
268
allele freq sample_id
269
-----------------------------
272
-----------------------------
274
What alignment types available
275
------------------------------
277
We create a ``Compara`` instance for human, chimpanzee and macaque.
281
>>> from cogent.db.ensembl import Compara
282
>>> compara = Compara(['human', 'chimp', 'macaque'], Release=58,
284
>>> print compara.method_species_links
286
===================================================================================================================
287
method_link_species_set_id method_link_id species_set_id align_method align_clade
288
-------------------------------------------------------------------------------------------------------------------
289
469 10 33006 PECAN 16 amniota vertebrates Pecan
290
467 13 32905 EPO 12 eutherian mammals EPO...
292
Get genomic alignment for a gene region
293
---------------------------------------
295
We first get the syntenic region corresponding to human gene *BRCA2*.
299
>>> from cogent.db.ensembl import Compara
300
>>> compara = Compara(['human', 'chimp', 'macaque'], Release=58,
302
>>> human_brca2 = compara.Human.getGeneByStableId(StableId='ENSG00000139618')
303
>>> regions = compara.getSyntenicRegions(region=human_brca2, align_method='EPO', align_clade='primates')
304
>>> for region in regions:
307
Coordinate(Human,chro...,13,32889610-32962969,1)
308
Coordinate(Chimp,chro...,13,32082473-32155304,1)
309
Coordinate(Macaque,chro...,17,11686607-11760932,1)...
311
We then get a cogent ``Alignment`` object, requesting that sequences be annotated for gene spans.
315
>>> aln = region.getAlignment(feature_types='gene')
317
3 x 11471 dna alignment: Homo sapiens:chromosome:13:3296...
319
Getting related genes
320
---------------------
322
What gene relationships are available
323
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
327
>>> from cogent.db.ensembl import Compara
328
>>> compara = Compara(['human', 'chimp', 'macaque'], Release=58,
330
>>> print compara.getDistinct('relationship')
331
['ortholog_one2one', 'within_species_paralog', 'ortholog_one2many', ...
333
Get one-to-one orthologs
334
^^^^^^^^^^^^^^^^^^^^^^^^
336
We get the one-to-one orthologs for *BRCA2*.
340
>>> from cogent.db.ensembl import Compara
341
>>> compara = Compara(['human', 'chimp', 'macaque'], Release=58,
343
>>> orthologs = compara.getRelatedGenes(StableId='ENSG00000139618',
344
... Relationship='ortholog_one2one')
347
Relationships=ortholog_one2one
348
Gene(Species='Pan troglodytes'; BioType='protein_coding'; Description='Breast cancer 2...'; Location=Coordinate(Chimp,chro...,13,32082479-32166147,1); StableId='ENSPTRG00000005766'; Status='KNOWN'; Symbol='Q8HZQ1_PANTR')...
350
We iterate over the related members.
354
>>> for ortholog in orthologs.Members:
356
Gene(Species='Pan troglodytes'; BioType='protein_coding'; Description='Breast...
358
We get statistics on the ortholog CDS lengths.
362
>>> print orthologs.getMaxCdsLengths()
363
[10242, 10008, 10257]
365
We get the sequences as a sequence collection, with annotations for gene.
369
>>> seqs = orthologs.getSeqCollection(feature_types='gene')
371
Get CDS for all one-to-one orthologs
372
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
374
We sample all one-to-one orthologs for a group of species, generating a FASTA formatted string that can be written to file. We check all species have an ortholog and that all are translatable.
378
>>> from cogent.core.alphabet import AlphabetError
379
>>> common_names = ["mouse", "rat", "human", "opossum"]
380
>>> latin_names = set([Species.getSpeciesName(n) for n in common_names])
381
>>> latin_to_common = dict(zip(latin_names, common_names))
382
>>> compara = Compara(common_names, Release=58, account=account)
383
>>> for gene in compara.Human.getGenesMatching(BioType='protein_coding'):
384
... orthologs = compara.getRelatedGenes(gene,
385
... Relationship='ortholog_one2one')
386
... # make sure all species represented
387
... if orthologs is None or orthologs.getSpeciesSet() != latin_names:
390
... for m in orthologs.Members:
391
... try: # if sequence can't be translated, we ignore it
392
... # get the CDS without the ending stop
393
... seq = m.CanonicalTranscript.Cds.withoutTerminalStopCodon()
394
... # make the sequence name
395
... seq.Name = '%s:%s:%s' % \
396
... (latin_to_common[m.genome.Species], m.StableId, m.Location)
397
... aa = seq.getTranslation()
399
... except (AlphabetError, AssertionError):
400
... seqs = [] # exclude this gene
402
... if len(seqs) == len(common_names):
403
... fasta = '\n'.join(s.toFasta() for s in seqs)
406
Get within species paralogs
407
^^^^^^^^^^^^^^^^^^^^^^^^^^^
411
>>> paralogs = compara.getRelatedGenes(StableId='ENSG00000164032',
412
... Relationship='within_species_paralog')
415
Relationships=within_species_paralog
416
Gene(Species='Homo sapiens'; BioType='protein_coding'; Description='H2A histone...