6
Biopython Tutorial and Cookbook
7
*******************************
8
Jeff Chang, Brad Chapman, Iddo Friedberg, Thomas Hamelryck,
9
===========================================================
10
Michiel de Hoon, Peter Cock, Tiago Antao
11
========================================
12
Last Update -- 16 August 2009 (Biopython 1.51)
13
==============================================
20
- Chapter 1 Introduction
22
- 1.1 What is Biopython?
23
- 1.2 What can I find in the Biopython package
24
- 1.3 Installing Biopython
27
- Chapter 2 Quick Start -- What can you do with Biopython?
29
- 2.1 General overview of what Biopython provides
30
- 2.2 Working with sequences
32
- 2.4 Parsing sequence file formats
34
- 2.4.1 Simple FASTA parsing example
35
- 2.4.2 Simple GenBank parsing example
36
- 2.4.3 I love parsing -- please don't stop talking about it!
38
- 2.5 Connecting with biological databases
41
- Chapter 3 Sequence objects
43
- 3.1 Sequences and Alphabets
44
- 3.2 Sequences act like strings
45
- 3.3 Slicing a sequence
46
- 3.4 Turning Seq objects into strings
47
- 3.5 Concatenating or adding sequences
48
- 3.6 Nucleotide sequences and (reverse) complements
51
- 3.9 Translation Tables
52
- 3.10 Comparing Seq objects
53
- 3.11 MutableSeq objects
54
- 3.12 UnknownSeq objects
55
- 3.13 Working with directly strings
57
- Chapter 4 Sequence Record objects
59
- 4.1 The SeqRecord object
60
- 4.2 Creating a SeqRecord
62
- 4.2.1 SeqRecord objects from scratch
63
- 4.2.2 SeqRecord objects from FASTA files
64
- 4.2.3 SeqRecord objects from GenBank files
66
- 4.3 SeqFeature objects
68
- 4.3.1 SeqFeatures themselves
72
- 4.5 The format method
73
- 4.6 Slicing a SeqRecord
75
- Chapter 5 Sequence Input/Output
77
- 5.1 Parsing or Reading Sequences
79
- 5.1.1 Reading Sequence Files
80
- 5.1.2 Iterating over the records in a sequence file
81
- 5.1.3 Getting a list of the records in a sequence file
82
- 5.1.4 Extracting data
84
- 5.2 Parsing sequences from the net
86
- 5.2.1 Parsing GenBank records from the net
87
- 5.2.2 Parsing SwissProt sequences from the net
89
- 5.3 Sequence files as Dictionaries
91
- 5.3.1 Specifying the dictionary keys
92
- 5.3.2 Indexing a dictionary using the SEGUID checksum
94
- 5.4 Writing Sequence Files
96
- 5.4.1 Converting between sequence file formats
97
- 5.4.2 Converting a file of sequences to their reverse
99
- 5.4.3 Getting your SeqRecord objects as formatted strings
102
- Chapter 6 Sequence Alignment Input/Output, and Alignment Tools
104
- 6.1 Parsing or Reading Sequence Alignments
106
- 6.1.1 Single Alignments
107
- 6.1.2 Multiple Alignments
108
- 6.1.3 Ambiguous Alignments
110
- 6.2 Writing Alignments
112
- 6.2.1 Converting between sequence alignment file formats
113
- 6.2.2 Getting your Alignment objects as formatted strings
115
- 6.3 Alignment Tools
119
- 6.3.3 MUSCLE using stdout
120
- 6.3.4 MUSCLE using stdin and stdout
121
- 6.3.5 EMBOSS needle and water
126
- 7.1 Running BLAST locally
127
- 7.2 Running BLAST over the Internet
128
- 7.3 Saving BLAST output
129
- 7.4 Parsing BLAST output
130
- 7.5 The BLAST record class
131
- 7.6 Deprecated BLAST parsers
133
- 7.6.1 Parsing plain-text BLAST output
134
- 7.6.2 Parsing a plain-text BLAST file full of BLAST runs
135
- 7.6.3 Finding a bad record somewhere in a huge plain-text
138
- 7.7 Dealing with PSI-BLAST
139
- 7.8 Dealing with RPS-BLAST
141
- Chapter 8 Accessing NCBI's Entrez databases
143
- 8.1 Entrez Guidelines
144
- 8.2 EInfo: Obtaining information about the Entrez databases
145
- 8.3 ESearch: Searching the Entrez databases
146
- 8.4 EPost: Uploading a list of identifiers
147
- 8.5 ESummary: Retrieving summaries from primary IDs
148
- 8.6 EFetch: Downloading full records from Entrez
149
- 8.7 ELink: Searching for related items in NCBI Entrez
150
- 8.8 EGQuery: Obtaining counts for search terms
151
- 8.9 ESpell: Obtaining spelling suggestions
152
- 8.10 Specialized parsers
154
- 8.10.1 Parsing Medline records
155
- 8.10.2 Parsing GEO records
156
- 8.10.3 Parsing UniGene records
161
- 8.12.1 PubMed and Medline
162
- 8.12.2 Searching, downloading, and parsing Entrez Nucleotide
164
- 8.12.3 Searching, downloading, and parsing GenBank records
165
- 8.12.4 Finding the lineage of an organism
167
- 8.13 Using the history and WebEnv
169
- 8.13.1 Searching for and downloading sequences using the
171
- 8.13.2 Searching for and downloading abstracts using the
175
- Chapter 9 Swiss-Prot and ExPASy
177
- 9.1 Parsing Swiss-Prot files
179
- 9.1.1 Parsing Swiss-Prot records
180
- 9.1.2 Parsing the Swiss-Prot keyword and category list
182
- 9.2 Parsing Prosite records
183
- 9.3 Parsing Prosite documentation records
184
- 9.4 Parsing Enzyme records
185
- 9.5 Accessing the ExPASy server
187
- 9.5.1 Retrieving a Swiss-Prot record
188
- 9.5.2 Searching Swiss-Prot
189
- 9.5.3 Retrieving Prosite and Prosite documentation records
191
- 9.6 Scanning the Prosite database
193
- Chapter 10 Going 3D: The PDB module
195
- 10.1 Structure representation
205
- 10.2.1 General approach
206
- 10.2.2 Disordered atoms
207
- 10.2.3 Disordered residues
209
- 10.3 Hetero residues
211
- 10.3.1 Associated problems
212
- 10.3.2 Water residues
213
- 10.3.3 Other hetero residues
215
- 10.4 Some random usage examples
216
- 10.5 Common problems in PDB files
219
- 10.5.2 Automatic correction
220
- 10.5.3 Fatal errors
222
- 10.6 Other features
224
- Chapter 11 Bio.PopGen: Population genetics
227
- 11.2 Coalescent simulation
229
- 11.2.1 Creating scenarios
230
- 11.2.2 Running SIMCOAL2
232
- 11.3 Other applications
234
- 11.3.1 FDist: Detecting selection and molecular adaptation
236
- 11.4 Future Developments
238
- Chapter 12 Supervised learning methods
240
- 12.1 The Logistic Regression Model
242
- 12.1.1 Background and Purpose
243
- 12.1.2 Training the logistic regression model
244
- 12.1.3 Using the logistic regression model for
246
- 12.1.4 Logistic Regression, Linear Discriminant Analysis,
247
and Support Vector Machines
249
- 12.2 k-Nearest Neighbors
251
- 12.2.1 Background and purpose
252
- 12.2.2 Initializing a k-nearest neighbors model
253
- 12.2.3 Using a k-nearest neighbors model for classification
256
- 12.4 Maximum Entropy
259
- Chapter 13 Graphics including GenomeDiagram
263
- 13.1.1 Introduction
264
- 13.1.2 Diagrams, tracks, feature-sets and features
265
- 13.1.3 A top down example
266
- 13.1.4 A bottom up example
267
- 13.1.5 Features without a SeqFeature
268
- 13.1.6 Feature captions
269
- 13.1.7 Feature sigils
270
- 13.1.8 A nice example
271
- 13.1.9 Further options
272
- 13.1.10 Converting old code
276
- Chapter 14 Cookbook -- Cool things to do with it
278
- 14.1 Working with sequence files
280
- 14.1.1 Producing randomised genomes
281
- 14.1.2 Translating a FASTA file of CDS entries
282
- 14.1.3 Simple quality filtering for FASTQ files
283
- 14.1.4 Trimming off primer sequences
284
- 14.1.5 Trimming off adaptor sequences
285
- 14.1.6 Converting FASTQ files
286
- 14.1.7 Identifying open reading frames
288
- 14.2 Sequence parsing plus simple plots
290
- 14.2.1 Histogram of sequence lengths
291
- 14.2.2 Plot of sequence GC%
292
- 14.2.3 Nucleotide dot plots
293
- 14.2.4 Plotting the quality scores of sequencing read data
295
- 14.3 Dealing with alignments
297
- 14.3.1 Calculating summary information
298
- 14.3.2 Calculating a quick consensus sequence
299
- 14.3.3 Position Specific Score Matrices
300
- 14.3.4 Information Content
302
- 14.4 Substitution Matrices
304
- 14.4.1 Using common substitution matrices
305
- 14.4.2 Creating your own substitution matrix from an
308
- 14.5 BioSQL -- storing sequences in a relational database
311
- Chapter 15 The Biopython testing framework
313
- 15.1 Running the tests
316
- 15.2.1 Writing a print-and-compare test
317
- 15.2.2 Writing a unittest-based test
319
- 15.3 Writing doctests
321
- Chapter 16 Advanced
324
- 16.2 Substitution Matrices
330
- Chapter 17 Where to go from here -- contributing to Biopython
332
- 17.1 Bug Reports + Feature Requests
333
- 17.2 Mailing lists and helping newcomers
334
- 17.3 Contributing Documentation
335
- 17.4 Contributing cookbook examples
336
- 17.5 Maintaining a distribution for a platform
337
- 17.6 Contributing Unit Tests
338
- 17.7 Contributing Code
340
- Chapter 18 Appendix: Useful stuff about Python
342
- 18.1 What the heck is a handle?
344
- 18.1.1 Creating a handle from a string
350
Chapter 1 Introduction
351
*************************
355
1.1 What is Biopython?
356
*=*=*=*=*=*=*=*=*=*=*=*
359
The Biopython Project is an international association of developers of
360
freely available Python (http://www.python.org) tools for computational
361
molecular biology. The web site http://www.biopython.org provides an
362
online resource for modules, scripts, and web links for developers of
363
Python-based software for life science research.
364
Basically, we just like to program in Python and want to make it as
365
easy as possible to use Python for bioinformatics by creating
366
high-quality, reusable modules and scripts.
369
1.2 What can I find in the Biopython package
370
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
373
The main Biopython releases have lots of functionality, including:
376
- The ability to parse bioinformatics files into Python utilizable
377
data structures, including support for the following formats:
380
- Blast output -- both from standalone and WWW Blast
385
- ExPASy files, like Enzyme and Prosite
386
- SCOP, including `dom' and `lin' files
391
- Files in the supported formats can be iterated over record by
392
record or indexed and accessed via a Dictionary interface.
394
- Code to deal with popular on-line bioinformatics destinations such
398
- NCBI -- Blast, Entrez and PubMed services
399
- ExPASy -- Swiss-Prot and Prosite entries, as well as Prosite
403
- Interfaces to common bioinformatics programs such as:
406
- Standalone Blast from NCBI
407
- Clustalw alignment program
408
- EMBOSS command line tools
411
- A standard sequence class that deals with sequences, ids on
412
sequences, and sequence features.
414
- Tools for performing common operations on sequences, such as
415
translation, transcription and weight calculations.
417
- Code to perform classification of data using k Nearest Neighbors,
418
Naive Bayes or Support Vector Machines.
420
- Code for dealing with alignments, including a standard way to
421
create and deal with substitution matrices.
423
- Code making it easy to split up parallelizable tasks into separate
426
- GUI-based programs to do basic sequence manipulations,
427
translations, BLASTing, etc.
429
- Extensive documentation and help with using the modules, including
430
this file, on-line wiki documentation, the web site, and the mailing
433
- Integration with BioSQL, a sequence database schema also supported
434
by the BioPerl and BioJava projects.
436
We hope this gives you plenty of reasons to download and start using
440
1.3 Installing Biopython
441
*=*=*=*=*=*=*=*=*=*=*=*=*
444
All of the installation information for Biopython was separated from
445
this document to make it easier to keep updated.
446
The short version is go to our downloads page
447
(http://biopython.org/wiki/Download), download and install the listed
448
dependencies, then download and install Biopython. For Windows we
449
provide pre-compiled click-and-run installers, while for Unix and other
450
operating systems you must install from source as described in the
451
included README file. This is usually as simple as the standard
453
<<python setup.py build
455
sudo python setup.py install
458
(You can in fact skip the build and test, and go straight to the
459
install -- but its better to make sure everything seems to be working.)
460
The longer version of our installation instructions covers
461
installation of Python, Biopython dependencies and Biopython itself. It
463
(http://biopython.org/DIST/docs/install/Installation.pdf) and HTML
464
formats (http://biopython.org/DIST/docs/install/Installation.html).
474
1. How do I cite Biopython in a scientific publication?
475
Please cite our application note, Cock et al. 2009,
476
doi:10.1093/bioinformatics/btp163 (1), and/or one of the publications
477
listed on our website describing specific modules within Biopython.
479
2. How should I capitalize "Biopython"? Is "BioPython" OK?
480
The correct capitalization is "Biopython", not "BioPython" (even
481
though that would have matched BioPerl, BioJava and BioRuby).
483
3. Where is the latest version of this document?
484
If you download a Biopython source code archive, it will include the
485
relevant version in both HTML and PDF formats. The latest published
486
version of this document is online at:
488
- http://biopython.org/DIST/docs/tutorial/Tutorial.html
489
- http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
492
4. Which "Numerical Python" do I need?
493
For Biopython 1.48 or earlier, you need the old Numeric module. For
494
Biopython 1.49 onwards, you need the newer NumPy instead. Both
495
Numeric and NumPy can be installed on the same machine fine. See
496
also: http://numpy.scipy.org/
498
5. Why is the 'Seq' object missing the (back) transcription &
499
translation methods described in this Tutorial?
500
You need Biopython 1.49 or later. Alternatively, use the 'Bio.Seq'
501
module functions described in Section 3.13.
503
6. Why does't the 'Seq' object translation method support the 'cds'
504
option described in this Tutorial?
505
You need Biopython 1.51 or later.
507
7. Why doesn't 'Bio.SeqIO' work? It imports fine but there is no
509
You need Biopython 1.43 or later. Older versions did contain some
510
related code under the 'Bio.SeqIO' name which has since been removed
511
- and this is why the import "works".
513
8. Why doesn't 'Bio.SeqIO.read()' work? The module imports fine but
514
there is no read function!
515
You need Biopython 1.45 or later. Or, use Bio.SeqIO.parse(...).next()
518
9. Why isn't 'Bio.AlignIO' present? The module import fails!
519
You need Biopython 1.46 or later.
521
10. What file formats do 'Bio.SeqIO' and 'Bio.AlignIO' read and
523
Check the built in docstrings (from Bio import SeqIO, then
524
help(SeqIO)), or see http://biopython.org/wiki/SeqIO and
525
http://biopython.org/wiki/AlignIO on the wiki for the latest listing.
527
11. Why don't the 'Bio.SeqIO' and 'Bio.AlignIO' input functions let
528
me provide a sequence alphabet?
529
You need Biopython 1.49 or later.
531
12. Why doesn't 'str(...)' give me the full sequence of a 'Seq'
533
You need Biopython 1.45 or later. Alternatively, rather than
534
'str(my_seq)', use 'my_seq.tostring()' (which will also work on
535
recent versions of Biopython).
537
13. Why doesn't 'Bio.Blast' work with the latest plain text NCBI
539
The NCBI keep tweaking the plain text output from the BLAST tools, and
540
keeping our parser up to date was an ongoing struggle. We recommend
541
you use the XML output instead, which is designed to be read by a
544
14. Why doesn't 'Bio.Entrez.read()' work? The module imports fine but
545
there is no read function!
546
You need Biopython 1.46 or later.
548
15. Why doesn't 'Bio.PDB.MMCIFParser' work? I see an import error
550
Since Biopython 1.42, the underlying 'Bio.PDB.mmCIF.MMCIFlex' module
551
has not been installed by default. It requires a third party tool
552
called flex (fast lexical analyzer generator). At the time of
553
writing, you'll have install flex, then tweak your Biopython
554
'setup.py' file and reinstall from source.
556
16. Why doesn't 'Bio.Blast.NCBIXML.read()' work? The module imports
557
but there is no read function!
558
You need Biopython 1.50 or later. Or, use
559
Bio.Blast.NCBIXML.parse(...).next() instead.
561
17. Why doesn't my 'SeqRecord' object have a 'letter_annotations'
563
Per-letter-annotation support was added in Biopython 1.50.
565
18. Why can't I slice my 'SeqRecord' to get a sub-record?
566
You need Biopython 1.50 or later.
568
19. I looked in a directory for code, but I couldn't find the code
569
that does something. Where's it hidden?
570
One thing to know is that we put code in '__init__.py' files. If you
571
are not used to looking for code in this file this can be confusing.
572
The reason we do this is to make the imports easier for users. For
573
instance, instead of having to do a "repetitive" import like 'from
574
Bio.GenBank import GenBank', you can just use 'from Bio import
577
-----------------------------------
580
(1) http://dx.doi.org/10.1093/bioinformatics/btp163
583
Chapter 2 Quick Start -- What can you do with Biopython?
584
***********************************************************
586
This section is designed to get you started quickly with Biopython,
587
and to give a general overview of what is available and how to use it.
588
All of the examples in this section assume that you have some general
589
working knowledge of Python, and that you have successfully installed
590
Biopython on your system. If you think you need to brush up on your
591
Python, the main Python web site provides quite a bit of free
592
documentation to get started with (http://www.python.org/doc/).
593
Since much biological work on the computer involves connecting with
594
databases on the internet, some of the examples will also require a
595
working internet connection in order to run.
596
Now that that is all out of the way, let's get into what we can do
600
2.1 General overview of what Biopython provides
601
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
604
As mentioned in the introduction, Biopython is a set of libraries to
605
provide the ability to deal with "things" of interest to biologists
606
working on the computer. In general this means that you will need to
607
have at least some programming experience (in Python, of course!) or at
608
least an interest in learning to program. Biopython's job is to make
609
your job easier as a programmer by supplying reusable libraries so that
610
you can focus on answering your specific question of interest, instead
611
of focusing on the internals of parsing a particular file format (of
612
course, if you want to help by writing a parser that doesn't exist and
613
contributing it to Biopython, please go ahead!). So Biopython's job is
615
One thing to note about Biopython is that it often provides multiple
616
ways of "doing the same thing." Things have improved in recent releases,
617
but this can still be frustrating as in Python there should ideally be
618
one right way to do something. However, this can also be a real benefit
619
because it gives you lots of flexibility and control over the libraries.
620
The tutorial helps to show you the common or easy ways to do things so
621
that you can just make things work. To learn more about the alternative
622
possibilities, look in the Cookbook (Chapter 14, this has some cools
623
tricks and tips), the Advanced section (Chapter 16), the built in
624
"docstrings" (via the Python help command, or the API documentation (1))
625
or ultimately the code itself.
628
2.2 Working with sequences
629
*=*=*=*=*=*=*=*=*=*=*=*=*=*
632
Disputably (of course!), the central object in bioinformatics is the
633
sequence. Thus, we'll start with a quick introduction to the Biopython
634
mechanisms for dealing with sequences, the 'Seq' object, which we'll
635
discuss in more detail in Chapter 3.
636
Most of the time when we think about sequences we have in my mind a
637
string of letters like `'AGTACACTGGT''. You can create such 'Seq' object
638
with this sequence as follows - the ">>>" represents the Python prompt
639
followed by what you would type in:
640
<<>>> from Bio.Seq import Seq
641
>>> my_seq = Seq("AGTACACTGGT")
643
Seq('AGTACACTGGT', Alphabet())
650
What we have here is a sequence object with a generic alphabet -
651
reflecting the fact we have not specified if this is a DNA or protein
652
sequence (okay, a protein with a lot of Alanines, Glycines, Cysteines
653
and Threonines!). We'll talk more about alphabets in Chapter 3.
654
In addition to having an alphabet, the 'Seq' object differs from the
655
Python string in the methods it supports. You can't do this with a plain
658
Seq('AGTACACTGGT', Alphabet())
659
>>> my_seq.complement()
660
Seq('TCATGTGACCA', Alphabet())
661
>>> my_seq.reverse_complement()
662
Seq('ACCAGTGTACT', Alphabet())
665
The next most important class is the 'SeqRecord' or Sequence Record.
666
This holds a sequence (as a 'Seq' object) with additional annotation
667
including an identifier, name and description. The 'Bio.SeqIO' module
668
for reading and writing sequence file formats works with 'SeqRecord'
669
objects, which will be introduced below and covered in more detail by
671
This covers the basic features and uses of the Biopython sequence
672
class. Now that you've got some idea of what it is like to interact with
673
the Biopython libraries, it's time to delve into the fun, fun world of
674
dealing with biological file formats!
681
Before we jump right into parsers and everything else to do with
682
Biopython, let's set up an example to motivate everything we do and make
683
life more interesting. After all, if there wasn't any biology in this
684
tutorial, why would you want you read it?
685
Since I love plants, I think we're just going to have to have a plant
686
based example (sorry to all the fans of other organisms out there!).
687
Having just completed a recent trip to our local greenhouse, we've
688
suddenly developed an incredible obsession with Lady Slipper Orchids (if
689
you wonder why, have a look at some Lady Slipper Orchids photos on
690
Flickr (2), or try a Google Image Search (3)).
691
Of course, orchids are not only beautiful to look at, they are also
692
extremely interesting for people studying evolution and systematics. So
693
let's suppose we're thinking about writing a funding proposal to do a
694
molecular study of Lady Slipper evolution, and would like to see what
695
kind of research has already been done and how we can add to that.
696
After a little bit of reading up we discover that the Lady Slipper
697
Orchids are in the Orchidaceae family and the Cypripedioideae sub-family
698
and are made up of 5 genera: Cypripedium, Paphiopedilum, Phragmipedium,
699
Selenipedium and Mexipedium.
700
That gives us enough to get started delving for more information. So,
701
let's look at how the Biopython tools can help us. We'll start with
702
sequence parsing in Section 2.4, but the orchids will be back later on
703
as well - for example we'll search PubMed for papers about orchids and
704
extract sequence data from GenBank in Chapter 8, extract data from
705
Swiss-Prot from certain orchid proteins in Chapter 9, and work with
706
ClustalW multiple sequence alignments of orchid proteins in
710
2.4 Parsing sequence file formats
711
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
714
A large part of much bioinformatics work involves dealing with the
715
many types of file formats designed to hold biological data. These files
716
are loaded with interesting biological data, and a special challenge is
717
parsing these files into a format so that you can manipulate them with
718
some kind of programming language. However the task of parsing these
719
files can be frustrated by the fact that the formats can change quite
720
regularly, and that formats may contain small subtleties which can break
721
even the most well designed parsers.
722
We are now going to briefly introduce the 'Bio.SeqIO' module -- you
723
can find out more in Chapter 5. We'll start with an online search for
724
our friends, the lady slipper orchids. To keep this introduction simple,
725
we're just using the NCBI website by hand. Let's just take a look
726
through the nucleotide databases at NCBI, using an Entrez online search
727
(http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide) for
728
everything mentioning the text Cypripedioideae (this is the subfamily of
729
lady slipper orchids).
730
When this tutorial was originally written, this search gave us only 94
731
hits, which we saved as a FASTA formatted text file and as a GenBank
732
formatted text file (files ls_orchid.fasta (4) and ls_orchid.gbk (5),
733
also included with the Biopython source code under
734
docs/tutorial/examples/).
735
If you run the search today, you'll get hundreds of results! When
736
following the tutorial, if you want to see the same list of genes, just
737
download the two files above or copy them from 'docs/examples/' in the
738
Biopython source code. In Section 2.5 we will look at how to do a search
739
like this from within Python.
742
2.4.1 Simple FASTA parsing example
743
===================================
745
If you open the lady slipper orchids FASTA file ls_orchid.fasta (6) in
746
your favourite text editor, you'll see that the file starts like this:
747
<<>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1
749
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
750
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
754
It contains 94 records, each has a line starting with ">"
755
(greater-than symbol) followed by the sequence on one or more lines. Now
757
<<from Bio import SeqIO
758
handle = open("ls_orchid.fasta")
759
for seq_record in SeqIO.parse(handle, "fasta") :
761
print repr(seq_record.seq)
762
print len(seq_record)
766
You should get something like this on your screen:
767
<<gi|2765658|emb|Z78533.1|CIZ78533
768
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',
769
SingleLetterAlphabet())
772
gi|2765564|emb|Z78439.1|PBZ78439
773
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC',
774
SingleLetterAlphabet())
778
Notice that the FASTA format does not specify the alphabet, so
779
'Bio.SeqIO' has defaulted to the rather generic 'SingleLetterAlphabet()'
780
rather than something DNA specific.
783
2.4.2 Simple GenBank parsing example
784
=====================================
786
Now let's load the GenBank file ls_orchid.gbk (7) instead - notice
787
that the code to do this is almost identical to the snippet used above
788
for the FASTA file - the only difference is we change the filename and
790
<<from Bio import SeqIO
791
handle = open("ls_orchid.gbk")
792
for seq_record in SeqIO.parse(handle, "genbank") :
794
print repr(seq_record.seq)
795
print len(seq_record)
801
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',
806
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC',
811
This time 'Bio.SeqIO' has been able to choose a sensible alphabet,
812
IUPAC Ambiguous DNA. You'll also notice that a shorter string has been
813
used as the 'seq_record.id' in this case.
816
2.4.3 I love parsing -- please don't stop talking about it!
817
============================================================
819
Biopython has a lot of parsers, and each has its own little special
820
niches based on the sequence format it is parsing and all of that.
821
Chapter 5 covers 'Bio.SeqIO' in more detail, while Chapter 6 introduces
822
'Bio.AlignIO' for sequence alignments.
823
While the most popular file formats have parsers integrated into
824
'Bio.SeqIO' and/or 'Bio.AlignIO', for some of the rarer and unloved file
825
formats there is either no parser at all, or an old parser which has not
826
been linked in yet. Please also check the wiki pages
827
http://biopython.org/wiki/SeqIO and http://biopython.org/wiki/AlignIO
828
for the latest information, or ask on the mailing list. The wiki pages
829
should include an up to date list of supported file types, and some
831
The next place to look for information about specific parsers and how
832
to do cool things with them is in the Cookbook (Chapter 14 of this
833
Tutorial). If you don't find the information you are looking for, please
834
consider helping out your poor overworked documentors and submitting a
835
cookbook entry about it! (once you figure out how to do it, that is!)
838
2.5 Connecting with biological databases
839
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
842
One of the very common things that you need to do in bioinformatics is
843
extract information from biological databases. It can be quite tedious
844
to access these databases manually, especially if you have a lot of
845
repetitive work to do. Biopython attempts to save you time and energy by
846
making some on-line databases available from Python scripts. Currently,
847
Biopython has code to extract information from the following databases:
850
- Entrez (8) (and PubMed (9)) from the NCBI -- See Chapter 8.
851
- ExPASy (10) -- See Chapter 9.
852
- SCOP (11) -- See the 'Bio.SCOP.search()' function.
854
The code in these modules basically makes it easy to write Python code
855
that interact with the CGI scripts on these pages, so that you can get
856
results in an easy to deal with format. In some cases, the results can
857
be tightly integrated with the Biopython parsers to make it even easier
858
to extract information.
865
Now that you've made it this far, you hopefully have a good
866
understanding of the basics of Biopython and are ready to start using it
867
for doing useful work. The best thing to do now is finish reading this
868
tutorial, and then if you want start snooping around in the source code,
869
and looking at the automatically generated documentation.
870
Once you get a picture of what you want to do, and what libraries in
871
Biopython will do it, you should take a peak at the Cookbook
872
(Chapter 14), which may have example code to do something similar to
874
If you know what you want to do, but can't figure out how to do it,
875
please feel free to post questions to the main Biopython list (see
876
http://biopython.org/wiki/Mailing_lists). This will not only help us
877
answer your question, it will also allow us to improve the documentation
878
so it can help the next person do what you want to do.
880
-----------------------------------
883
(1) http://biopython.org/DIST/docs/api/
885
(2) http://www.flickr.com/search/?q=lady+slipper+orchid&s=int&z=t
887
(3) http://images.google.com/images?q=lady%20slipper%20orchid
889
(4) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta
891
(5) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk
893
(6) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta
895
(7) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk
897
(8) http://www.ncbi.nlm.nih.gov/Entrez/
899
(9) http://www.ncbi.nlm.nih.gov/PubMed/
901
(10) http://www.expasy.org/
903
(11) http://scop.mrc-lmb.cam.ac.uk/scop/
906
Chapter 3 Sequence objects
907
*****************************
909
Biological sequences are arguably the central object in
910
Bioinformatics, and in this chapter we'll introduce the Biopython
911
mechanism for dealing with sequences, the 'Seq' object. Chapter 4 will
912
introduce the related 'SeqRecord' object, which combines the sequence
913
information with any annotation, used again in Chapter 5 for Sequence
915
Sequences are essentially strings of letters like 'AGTACACTGGT', which
916
seems very natural since this is the most common way that sequences are
917
seen in biological file formats.
918
There are two important differences between 'Seq' objects and standard
919
Python strings. First of all, they have different methods. Although the
920
'Seq' object supports many of the same methods as a plain string, its
921
'translate()' method differs by doing biological translation, and there
922
are also additional biologically relevant methods like
923
'reverse_complement()'. Secondly, the 'Seq' object has an important
924
attribute, 'alphabet', which is an object describing what the individual
925
characters making up the sequence string "mean", and how they should be
926
interpreted. For example, is 'AGTACACTGGT' a DNA sequence, or just a
927
protein sequence that happens to be rich in Alanines, Glycines,
928
Cysteines and Threonines?
931
3.1 Sequences and Alphabets
932
*=*=*=*=*=*=*=*=*=*=*=*=*=*=
935
The alphabet object is perhaps the important thing that makes the
936
'Seq' object more than just a string. The currently available alphabets
937
for Biopython are defined in the 'Bio.Alphabet' module. We'll use the
938
IUPAC alphabets (http://www.chem.qmw.ac.uk/iupac/) here to deal with
939
some of our favorite objects: DNA, RNA and Proteins.
940
'Bio.Alphabet.IUPAC' provides basic definitions for proteins, DNA and
941
RNA, but additionally provides the ability to extend and customize the
942
basic definitions. For instance, for proteins, there is a basic
943
IUPACProtein class, but there is an additional ExtendedIUPACProtein
944
class providing for the additional elements "U" (or "Sec" for
945
selenocysteine) and "O" (or "Pyl" for pyrrolysine), plus the ambiguous
946
symbols "B" (or "Asx" for asparagine or aspartic acid), "Z" (or "Glx"
947
for glutamine or glutamic acid), "J" (or "Xle" for leucine isoleucine)
948
and "X" (or "Xxx" for an unknown amino acid). For DNA you've got choices
949
of IUPACUnambiguousDNA, which provides for just the basic letters,
950
IUPACAmbiguousDNA (which provides for ambiguity letters for every
951
possible situation) and ExtendedIUPACDNA, which allows letters for
952
modified bases. Similarly, RNA can be represented by IUPACAmbiguousRNA
953
or IUPACUnambiguousRNA.
954
The advantages of having an alphabet class are two fold. First, this
955
gives an idea of the type of information the Seq object contains.
956
Secondly, this provides a means of constraining the information, as a
957
means of type checking.
958
Now that we know what we are dealing with, let's look at how to
959
utilize this class to do interesting work. You can create an ambiguous
960
sequence with the default generic alphabet like this:
961
<<>>> from Bio.Seq import Seq
962
>>> my_seq = Seq("AGTACACTGGT")
964
Seq('AGTACACTGGT', Alphabet())
969
However, where possible you should specify the alphabet explicitly
970
when creating your sequence objects - in this case an unambiguous DNA
972
<<>>> from Bio.Seq import Seq
973
>>> from Bio.Alphabet import IUPAC
974
>>> my_seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna)
976
Seq('AGTACACTGGT', IUPACUnambiguousDNA())
978
IUPACUnambiguousDNA()
981
Unless of course, this really is an amino acid sequence:
982
<<>>> from Bio.Seq import Seq
983
>>> from Bio.Alphabet import IUPAC
984
>>> my_prot = Seq("AGTACACTGGT", IUPAC.protein)
986
Seq('AGTACACTGGT', IUPACProtein())
993
3.2 Sequences act like strings
994
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
997
In many ways, we can deal with Seq objects as if they were normal
998
Python strings, for example getting the length, or iterating over the
1000
<<from Bio.Seq import Seq
1001
from Bio.Alphabet import IUPAC
1002
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",
1003
IUPAC.unambiguous_dna)
1004
for index, letter in enumerate(my_seq) :
1009
You can access elements of the sequence in the same way as for strings
1010
(but remember, Python counts from zero!):
1011
<<>>> print my_seq[0] #first letter
1012
>>> print my_seq[2] #third letter
1013
>>> print my_seq[-1] #last letter
1016
The 'Seq' object has a '.count()' method, just like a string. Note
1017
that this means that like a Python string, this gives a non-overlapping
1019
<<>>> "AAAA".count("AA")
1021
>>> Seq("AAAA").count("AA")
1025
For some biological uses, you may actually want an overlapping count
1026
(i.e. 3 in this trivial example). When searching for single letters,
1027
this makes no difference:
1030
>>> my_seq.count("G")
1032
>>> 100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)
1036
While you could use the above snippet of code to calculate a GC%, note
1037
that the 'Bio.SeqUtils' module has several GC functions already built.
1039
<<>>> from Bio.Seq import Seq
1040
>>> from Bio.Alphabet import IUPAC
1041
>>> from Bio.SeqUtils import GC
1042
>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',
1043
IUPAC.unambiguous_dna)
1048
Note that using the 'Bio.SeqUtils.GC()' function should automatically
1049
cope with mixed case sequences and the ambiguous nucleotide S which
1051
Also note that just like a normal Python string, the 'Seq' object is
1052
in some ways "read-only". If you need to edit your sequence, for example
1053
simulating a point mutation, look at the Section 3.11 below which talks
1054
about the 'MutableSeq' object.
1057
3.3 Slicing a sequence
1058
*=*=*=*=*=*=*=*=*=*=*=*
1061
A more complicated example, let's get a slice of the sequence:
1062
<<>>> from Bio.Seq import Seq
1063
>>> from Bio.Alphabet import IUPAC
1064
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",
1065
IUPAC.unambiguous_dna)
1067
Seq('GATGGGCC', IUPACUnambiguousDNA())
1070
Two things are interesting to note. First, this follows the normal
1071
conventions for Python strings. So the first element of the sequence is
1072
0 (which is normal for computer science, but not so normal for biology).
1073
When you do a slice the first item is included (i.e. 4 in this case) and
1074
the last is excluded (12 in this case), which is the way things work in
1075
Python, but of course not necessarily the way everyone in the world
1076
would expect. The main goal is to stay consistent with what Python does.
1077
The second thing to notice is that the slice is performed on the
1078
sequence data string, but the new object produced is another 'Seq'
1079
object which retains the alphabet information from the original 'Seq'
1081
Also like a Python string, you can do slices with a start, stop and
1082
stride (the step size, which defaults to one). For example, we can get
1083
the first, second and third codon positions of this DNA sequence:
1085
Seq('GCTGTAGTAAG', IUPACUnambiguousDNA())
1087
Seq('AGGCATGCATC', IUPACUnambiguousDNA())
1089
Seq('TAGCTAAGAC', IUPACUnambiguousDNA())
1092
Another stride trick you might have seen with a Python string is the
1093
use of a -1 stride to reverse the string. You can do this with a 'Seq'
1096
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())
1101
3.4 Turning Seq objects into strings
1102
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
1105
If you really do just need a plain string, for example to write to a
1106
file, or insert into a database, then this is very easy to get:
1108
'GATCGATGGGCCTATATAGGATCGAAAATCGC'
1111
Since calling 'str()' on a 'Seq' object returns the full sequence as a
1112
string, you often don't actually have to do this conversion explicitly.
1113
Python does this automatically with a print statement:
1115
GATCGATGGGCCTATATAGGATCGAAAATCGC
1118
You can also use the 'Seq' object directly with a '%s' placeholder
1119
when using the Python string formatting or interpolation operator ('%'):
1120
<<>>> fasta_format_string = ">Name\n%s\n" % my_seq
1121
>>> print fasta_format_string
1123
GATCGATGGGCCTATATAGGATCGAAAATCGC
1126
This line of code constructs a simple FASTA format record (without
1127
worrying about line wrapping). Section 4.5 describes a neat way to get a
1128
FASTA formatted string from a 'SeqRecord' object, while the more general
1129
topic of reading and writing FASTA format sequence files is covered in
1131
NOTE: If you are using Biopython 1.44 or older, using 'str(my_seq)'
1132
will give just a truncated representation. Instead use
1133
'my_seq.tostring()' (which is still available in the current Biopython
1134
releases for backwards compatibility):
1135
<<>>> my_seq.tostring()
1136
'GATCGATGGGCCTATATAGGATCGAAAATCGC'
1141
3.5 Concatenating or adding sequences
1142
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
1145
Naturally, you can in principle add any two Seq objects together -
1146
just like you can with Python strings to concatenate them. However, you
1147
can't add sequences with incompatible alphabets, such as a protein
1148
sequence and a DNA sequence:
1149
<<>>> protein_seq + dna_seq
1150
Traceback (most recent call last):
1152
TypeError: ('incompatable alphabets', 'IUPACProtein()',
1153
'IUPACUnambiguousDNA()')
1156
If you really wanted to do this, you'd have to first give both
1157
sequences generic alphabets:
1158
<<>>> from Bio.Alphabet import generic_alphabet
1159
>>> protein_seq.alphabet = generic_alphabet
1160
>>> dna_seq.alphabet = generic_alphabet
1161
>>> protein_seq + dna_seq
1162
Seq('EVRNAKACGT', Alphabet())
1165
Here is an example of adding a generic nucleotide sequence to an
1166
unambiguous IUPAC DNA sequence, resulting in an ambiguous nucleotide
1168
<<>>> from Bio.Seq import Seq
1169
>>> from Bio.Alphabet import generic_nucleotide
1170
>>> from Bio.Alphabet import IUPAC
1171
>>> nuc_seq = Seq("GATCGATGC", generic_nucleotide)
1172
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
1174
Seq('GATCGATGC', NucleotideAlphabet())
1176
Seq('ACGT', IUPACUnambiguousDNA())
1177
>>> nuc_seq + dna_seq
1178
Seq('GATCGATGCACGT', NucleotideAlphabet())
1183
3.6 Nucleotide sequences and (reverse) complements
1184
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
1187
For nucleotide sequences, you can easily obtain the complement or
1188
reverse complement of a 'Seq' object using its built-in methods:
1189
<<>>> from Bio.Seq import Seq
1190
>>> from Bio.Alphabet import IUPAC
1191
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",
1192
IUPAC.unambiguous_dna)
1194
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())
1195
>>> my_seq.complement()
1196
Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())
1197
>>> my_seq.reverse_complement()
1198
Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())
1201
In all of these operations, the alphabet property is maintained. This
1202
is very useful in case you accidentally end up trying to do something
1203
weird like take the (reverse)complement of a protein sequence:
1204
<<>>> from Bio.Seq import Seq
1205
>>> from Bio.Alphabet import IUPAC
1206
>>> protein_seq = Seq("EVRNAK", IUPAC.protein)
1207
>>> protein_seq.complement()
1209
ValueError: Proteins do not have complements!
1212
The example in Section 5.4.2 combines the 'Seq' object's reverse
1213
complement method with 'Bio.SeqIO' for sequence input/ouput.
1219
Before talking about transcription, I want to try and clarify the
1220
strand issue. Consider the following (made up) stretch of double
1221
stranded DNA which encodes a short peptide:
1223
DNA coding strand (aka Crick strand, strand +1)
1224
5' ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3'
1225
|||||||||||||||||||||||||||||||||||||||
1226
3' TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5'
1227
DNA template strand (aka Watson strand, strand -1)
1233
5' AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3'
1234
Single stranded messenger RNA
1237
The actual biological transcription process works from the template
1238
strand, doing a reverse complement (TCAG -> CUGA) to give the mRNA.
1239
However, in Biopython and bioinformatics in general, we typically work
1240
directly with the coding strand because this means we can get the mRNA
1241
sequence just by switching T -> U.
1242
Now let's actually get down to doing a transcription in Biopython.
1243
First, let's create 'Seq' objects for the coding and template DNA
1245
<<>>> from Bio.Seq import Seq
1246
>>> from Bio.Alphabet import IUPAC
1247
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",
1248
IUPAC.unambiguous_dna)
1250
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
1251
>>> template_dna = coding_dna.reverse_complement()
1253
Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT', IUPACUnambiguousDNA())
1255
These should match the figure above - remember by convention
1256
nucleotide sequences are normally read from the 5' to 3' direction,
1257
while in the figure the template strand is shown reversed.
1258
Now let's transcribe the coding strand into the corresponding mRNA,
1259
using the 'Seq' object's built in 'transcribe' method:
1261
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
1262
>>> messenger_rna = coding_dna.transcribe()
1264
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
1266
As you can see, all this does is switch T -> U, and adjust the
1268
If you do want to do a true biological transcription starting with the
1269
template strand, then this becomes a two-step process:
1270
<<>>> template_dna.reverse_complement().transcribe()
1271
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
1274
The 'Seq' object also includes a back-transcription method for going
1275
from the mRNA to the coding strand of the DNA. Again, this is a simple U
1276
-> T substitution and associated change of alphabet:
1277
<<>>> from Bio.Seq import Seq
1278
>>> from Bio.Alphabet import IUPAC
1279
>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG",
1280
IUPAC.unambiguous_rna)
1282
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
1283
>>> messenger_rna.back_transcribe()
1284
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
1287
Note: The 'Seq' object's 'transcribe' and 'back_transcribe' methods
1288
are new in Biopython 1.49. For older releases you would have to use the
1289
'Bio.Seq' module's functions instead, see Section 3.13.
1295
Sticking with the same example discussed in the transcription
1296
section above, now let's translate this mRNA into the corresponding
1297
protein sequence - again taking advantage of one of the 'Seq' object's
1299
<<>>> from Bio.Seq import Seq
1300
>>> from Bio.Alphabet import IUPAC
1301
>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG",
1302
IUPAC.unambiguous_rna)
1304
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
1305
>>> messenger_rna.translate()
1306
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
1309
You can also translate directly from the coding strand DNA sequence:
1310
<<>>> from Bio.Seq import Seq
1311
>>> from Bio.Alphabet import IUPAC
1312
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",
1313
IUPAC.unambiguous_dna)
1315
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
1316
>>> coding_dna.translate()
1317
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
1320
You should notice in the above protein sequences that in addition to
1321
the end stop character, there is an internal stop as well. This was a
1322
deliberate choice of example, as it gives an excuse to talk about some
1323
optional arguments, including different translation tables (Genetic
1325
The translation tables available in Biopython are based on those from
1326
the NCBI (1) (see the next section of this tutorial). By default,
1327
translation will use the standard genetic code (NCBI table id 1).
1328
Suppose we are dealing with a mitochondrial sequence. We need to tell
1329
the translation function to use the relevant genetic code instead:
1330
<<>>> coding_dna.translate(table="Vertebrate Mitochondrial")
1331
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
1334
You can also specify the table using the NCBI table number which is
1335
shorter, and often included in the feature annotation of GenBank files:
1336
<<>>> coding_dna.translate(table=2)
1337
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
1340
Now, you may want to translate the nucleotides up to the first in
1341
frame stop codon, and then stop (as happens in nature):
1342
<<>>> coding_dna.translate()
1343
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
1344
>>> coding_dna.translate(to_stop=True)
1345
Seq('MAIVMGR', IUPACProtein())
1346
>>> coding_dna.translate(table=2)
1347
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
1348
>>> coding_dna.translate(table=2, to_stop=True)
1349
Seq('MAIVMGRWKGAR', IUPACProtein())
1351
Notice that when you use the 'to_stop' argument, the stop codon itself
1352
is not translated - and the stop symbol is not included at the end of
1353
your protein sequence.
1354
You can even specify the stop symbol if you don't like the default
1356
<<>>> coding_dna.translate(table=2, stop_symbol="@")
1357
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '@'))
1360
Now, suppose you have a complete coding sequence CDS, which is to say
1361
a nucleotide sequence (e.g. mRNA -- after any splicing) which is a whole
1362
number of codons (i.e. the length is a multiple of three), commences
1363
with a start codon, ends with a stop codon, and has no internal in-frame
1364
stop codons. In general, given a complete CDS, the default translate
1365
method will do what you want (perhaps with the 'to_stop' option).
1366
However, what if your sequence uses a non-standard start codon? This
1367
happens a lot in bacteria -- for example the gene yaaX in E. coli K12:
1369
Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" + \
1371
"GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \
1373
"AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \
1375
"TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \
1377
"AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA",
1379
>>> gene.translate(table="Bacterial")
1380
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*',
1381
HasStopCodon(ExtendedIUPACProtein(), '*')
1382
>>> gene.translate(table="Bacterial", to_stop=True)
1383
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR',
1384
ExtendedIUPACProtein())
1387
In the bacterial genetic code GTG is a valid start codon, and while it
1388
does normally encode valine, if used as a start codon it should be
1389
translated as methionine. This happens if you tell Biopython your
1390
sequence is a complete CDS:
1391
<<>>> gene.translate(table="Bacterial", cds=True)
1392
Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR',
1393
ExtendedIUPACProtein())
1396
In addition to telling Biopython to translate an alternative start
1397
codon as methionine, using this option also makes sure your sequence
1398
really is a valid CDS (you'll get an exception if not).
1399
The example in Section 14.1.2 combines the 'Seq' object's translate
1400
method with 'Bio.SeqIO' for sequence input/ouput.
1401
Note: The 'Seq' object's 'translate' method is new in Biopython 1.49.
1402
For older releases you would have to use the 'Bio.Seq' module's
1403
'translate' function instead, see Section 3.13. The cds option is new in
1404
Biopython 1.51, and there is no simple way to do this with older
1405
versions of Biopython.
1408
3.9 Translation Tables
1409
*=*=*=*=*=*=*=*=*=*=*=*
1412
In the previous sections we talked about the 'Seq' object translation
1413
method (and mentioned the equivalent function in the 'Bio.Seq' module --
1414
see Section 3.13). Internally these use codon table objects derived from
1415
the NCBI information at
1416
ftp://ftp.ncbi.nlm.nih.gov/entrez/misc/data/gc.prt, also shown on
1417
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi in a much more
1419
As before, let's just focus on two choices: the Standard translation
1420
table, and the translation table for Vertebrate Mitochondrial DNA.
1421
<<>>> from Bio.Data import CodonTable
1422
>>> standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
1423
>>> mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate
1427
Alternatively, these tables are labeled with ID numbers 1 and 2,
1429
<<>>> from Bio.Data import CodonTable
1430
>>> standard_table = CodonTable.unambiguous_dna_by_id[1]
1431
>>> mito_table = CodonTable.unambiguous_dna_by_id[2]
1434
You can compare the actual tables visually by printing them:
1435
<<>>> print standard_table
1436
Table 1 Standard, SGC0
1439
--+---------+---------+---------+---------+--
1440
T | TTT F | TCT S | TAT Y | TGT C | T
1441
T | TTC F | TCC S | TAC Y | TGC C | C
1442
T | TTA L | TCA S | TAA Stop| TGA Stop| A
1443
T | TTG L(s)| TCG S | TAG Stop| TGG W | G
1444
--+---------+---------+---------+---------+--
1445
C | CTT L | CCT P | CAT H | CGT R | T
1446
C | CTC L | CCC P | CAC H | CGC R | C
1447
C | CTA L | CCA P | CAA Q | CGA R | A
1448
C | CTG L(s)| CCG P | CAG Q | CGG R | G
1449
--+---------+---------+---------+---------+--
1450
A | ATT I | ACT T | AAT N | AGT S | T
1451
A | ATC I | ACC T | AAC N | AGC S | C
1452
A | ATA I | ACA T | AAA K | AGA R | A
1453
A | ATG M(s)| ACG T | AAG K | AGG R | G
1454
--+---------+---------+---------+---------+--
1455
G | GTT V | GCT A | GAT D | GGT G | T
1456
G | GTC V | GCC A | GAC D | GGC G | C
1457
G | GTA V | GCA A | GAA E | GGA G | A
1458
G | GTG V | GCG A | GAG E | GGG G | G
1459
--+---------+---------+---------+---------+--
1462
<<>>> print mito_table
1463
Table 2 Vertebrate Mitochondrial, SGC1
1466
--+---------+---------+---------+---------+--
1467
T | TTT F | TCT S | TAT Y | TGT C | T
1468
T | TTC F | TCC S | TAC Y | TGC C | C
1469
T | TTA L | TCA S | TAA Stop| TGA W | A
1470
T | TTG L | TCG S | TAG Stop| TGG W | G
1471
--+---------+---------+---------+---------+--
1472
C | CTT L | CCT P | CAT H | CGT R | T
1473
C | CTC L | CCC P | CAC H | CGC R | C
1474
C | CTA L | CCA P | CAA Q | CGA R | A
1475
C | CTG L | CCG P | CAG Q | CGG R | G
1476
--+---------+---------+---------+---------+--
1477
A | ATT I(s)| ACT T | AAT N | AGT S | T
1478
A | ATC I(s)| ACC T | AAC N | AGC S | C
1479
A | ATA M(s)| ACA T | AAA K | AGA Stop| A
1480
A | ATG M(s)| ACG T | AAG K | AGG Stop| G
1481
--+---------+---------+---------+---------+--
1482
G | GTT V | GCT A | GAT D | GGT G | T
1483
G | GTC V | GCC A | GAC D | GGC G | C
1484
G | GTA V | GCA A | GAA E | GGA G | A
1485
G | GTG V(s)| GCG A | GAG E | GGG G | G
1486
--+---------+---------+---------+---------+--
1489
You may find these following properties useful -- for example if you
1490
are trying to do your own gene finding:
1491
<<>>> mito_table.stop_codons
1492
['TAA', 'TAG', 'AGA', 'AGG']
1493
>>> mito_table.start_codons
1494
['ATT', 'ATC', 'ATA', 'ATG', 'GTG']
1495
>>> mito_table.forward_table["ACG"]
1501
3.10 Comparing Seq objects
1502
*=*=*=*=*=*=*=*=*=*=*=*=*=*
1505
Sequence comparison is actually a very complicated topic, and there is
1506
no easy way to decide if two sequences are equal. The basic problem is
1507
the meaning of the letters in a sequence are context dependent - the
1508
letter "A" could be part of a DNA, RNA or protein sequence. Biopython
1509
uses alphabet objects as part of each 'Seq' object to try and capture
1510
this information - so comparing two 'Seq' objects means considering both
1511
the sequence strings and the alphabets.
1512
For example, you might argue that the two DNA 'Seq' objects
1513
Seq("ACGT", IUPAC.unambiguous_dna) and Seq("ACGT", IUPAC.ambiguous_dna)
1514
should be equal, even though they do have different alphabets. Depending
1515
on the context this could be important.
1516
This gets worse -- suppose you think Seq("ACGT",
1517
IUPAC.unambiguous_dna) and Seq("ACGT") (i.e. the default generic
1518
alphabet) should be equal. Then, logically, Seq("ACGT", IUPAC.protein)
1519
and Seq("ACGT") should also be equal. Now, in logic if A=B and B=C, by
1520
transitivity we expect A=C. So for logical consistency we'd require
1521
Seq("ACGT", IUPAC.unambiguous_dna) and Seq("ACGT", IUPAC.protein) to be
1522
equal -- which most people would agree is just not right. This
1523
transitivity problem would also have implications for using 'Seq'
1524
objects as Python dictionary keys.
1525
So, what does Biopython do? Well, the equality test is the default for
1526
Python objects -- it tests to see if they are the same object in memory.
1527
This is a very strict test:
1528
<<>>> from Bio.Seq import Seq
1529
>>> from Bio.Alphabet import IUPAC
1530
>>> seq1 = Seq("ACGT", IUPAC.unambiguous_dna)
1531
>>> seq2 = Seq("ACGT", IUPAC.unambiguous_dna)
1538
If you actually want to do this, you can be more explicit by using the
1539
Python 'id' function,
1540
<<>>> id(seq1) == id(seq2)
1542
>>> id(seq1) == id(seq1)
1546
Now, in every day use, your sequences will probably all have the same
1547
alphabet, or at least all be the same type of sequence (all DNA, all
1548
RNA, or all protein). What you probably want is to just compare the
1549
sequences as strings -- so do this explicitly:
1550
<<>>> str(seq1) == str(seq2)
1552
>>> str(seq1) == str(seq1)
1556
As an extension to this, while you can use a Python dictionary with
1557
'Seq' objects as keys, it is generally more useful to use the sequence a
1558
string for the key. See also Section 3.4.
1561
3.11 MutableSeq objects
1562
*=*=*=*=*=*=*=*=*=*=*=*=
1565
Just like the normal Python string, the 'Seq' object is "read only",
1566
or in Python terminology, immutable. Apart from wanting the 'Seq' object
1567
to act like a string, this is also a useful default since in many
1568
biological applications you want to ensure you are not changing your
1570
<<>>> from Bio.Seq import Seq
1571
>>> from Bio.Alphabet import IUPAC
1572
>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
1573
IUPAC.unambiguous_dna)
1575
Traceback (most recent call last):
1576
File "<stdin>", line 1, in ?
1577
AttributeError: 'Seq' instance has no attribute '__setitem__'
1580
However, you can convert it into a mutable sequence (a 'MutableSeq'
1581
object) and do pretty much anything you want with it:
1582
<<>>> mutable_seq = my_seq.tomutable()
1584
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
1587
Alternatively, you can create a 'MutableSeq' object directly from a
1589
<<>>> from Bio.Seq import MutableSeq
1590
>>> from Bio.Alphabet import IUPAC
1591
>>> mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
1592
IUPAC.unambiguous_dna)
1595
Either way will give you a sequence object which can be changed:
1597
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
1598
>>> mutable_seq[5] = "T"
1600
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
1601
>>> mutable_seq.remove("T")
1603
MutableSeq('GCCATGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
1604
>>> mutable_seq.reverse()
1606
MutableSeq('AGCCCGTGGGAAAGTCGCCGGGTAATGTACCG', IUPACUnambiguousDNA())
1609
Do note that unlike the 'Seq' object, the 'MutableSeq' object's
1610
methods like 'reverse_complement()' and 'reverse()' act in-situ!
1611
An important technical difference between mutable and immutable
1612
objects in Python means that you can't use a 'MutableSeq' object as a
1613
dictionary key, but you can use a Python string or a 'Seq' object in
1615
Once you have finished editing your a 'MutableSeq' object, it's easy
1616
to get back to a read-only 'Seq' object should you need to:
1617
<<>>> new_seq = mutable_seq.toseq()
1619
Seq('AGCCCGTGGGAAAGTCGCCGGGTAATGTACCG', IUPACUnambiguousDNA())
1622
You can also get a string from a 'MutableSeq' object just like from a
1623
'Seq' object (Section 3.4).
1626
3.12 UnknownSeq objects
1627
*=*=*=*=*=*=*=*=*=*=*=*=
1629
Biopython 1.50 introduced another basic sequence object, the
1630
'UnknownSeq' object. This is a subclass of the basic 'Seq' object and
1631
its purpose is to represent a sequence where we know the length, but not
1632
the actual letters making it up. You could of course use a normal 'Seq'
1633
object in this situation, but it wastes rather a lot of memory to hold a
1634
string of a million "N" characters when you could just store a single
1635
letter "N" and the desired length as an integer.
1636
<<>>> from Bio.Seq import UnknownSeq
1637
>>> unk = UnknownSeq(20)
1639
UnknownSeq(20, alphabet = Alphabet(), character = '?')
1641
????????????????????
1646
You can of course specify an alphabet, meaning for nucleotide
1647
sequences the letter defaults to "N" and for proteins "X", rather than
1649
<<>>> from Bio.Seq import UnknownSeq
1650
>>> from Bio.Alphabet import IUPAC
1651
>>> unk_dna = UnknownSeq(20, alphabet=IUPAC.ambiguous_dna)
1653
UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')
1655
NNNNNNNNNNNNNNNNNNNN
1658
You can use all the usual 'Seq' object methods too, note these give
1659
back memory saving 'UnknownSeq' objects where appropriate as you might
1662
UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')
1663
>>> unk_dna.complement()
1664
UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')
1665
>>> unk_dna.reverse_complement()
1666
UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')
1667
>>> unk_dna.transcribe()
1668
UnknownSeq(20, alphabet = IUPACAmbiguousRNA(), character = 'N')
1669
>>> unk_protein = unk_dna.translate()
1671
UnknownSeq(6, alphabet = ProteinAlphabet(), character = 'X')
1672
>>> print unk_protein
1674
>>> len(unk_protein)
1678
You may be able to find a use for the 'UnknownSeq' object in your own
1679
code, but it is more likely that you will first come across them in a
1680
'SeqRecord' object created by 'Bio.SeqIO' (see Chapter 5). Some sequence
1681
file formats don't always include the actual sequence, for example
1682
GenBank and EMBL files may include a list of features but for the
1683
sequence just present the contig information. Alternatively, the QUAL
1684
files used in sequencing work hold quality scores but they never contain
1685
a sequence -- instead there is a partner FASTA file which does have the
1689
3.13 Working with directly strings
1690
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
1692
To close this chapter, for those you who really don't want to use
1693
the sequence objects (or who prefer a functional programming style to an
1694
object orientated one), there are module level functions in 'Bio.Seq'
1695
will accept plain Python strings, 'Seq' objects (including 'UnknownSeq'
1696
objects) or 'MutableSeq' objects:
1697
<<>>> from Bio.Seq import reverse_complement, transcribe,
1698
back_transcribe, translate
1699
>>> my_string = "GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG"
1700
>>> reverse_complement(my_string)
1701
'CTAACCAGCAGCACGACCACCCTTCCAACGACCCATAACAGC'
1702
>>> transcribe(my_string)
1703
'GCUGUUAUGGGUCGUUGGAAGGGUGGUCGUGCUGCUGGUUAG'
1704
>>> back_transcribe(my_string)
1705
'GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG'
1706
>>> translate(my_string)
1710
You are, however, encouraged to work with 'Seq' objects by default.
1711
-----------------------------------
1714
(1) http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
1717
Chapter 4 Sequence Record objects
1718
************************************
1720
Chapter 3 introduced the sequence classes. Immediately "above" the
1721
'Seq' class is the Sequence Record or 'SeqRecord' class, defined in the
1722
'Bio.SeqRecord' module. This class allows higher level features such as
1723
identifiers and features to be associated with the sequence, and is used
1724
thoughout the sequence input/output interface 'Bio.SeqIO' described
1726
If you are only going to be working with simple data like FASTA files,
1727
you can probably skip this chapter for now. If on the other hand you are
1728
going to be using richly annotated sequence data, say from GenBank or
1729
EMBL files, this information is quite important.
1730
While this chapter should cover most things to do with the 'SeqRecord'
1731
object in this chapter, you may also want to read the 'SeqRecord' wiki
1732
page (http://biopython.org/wiki/SeqRecord), and the built in
1733
documentation (also online (1)):
1734
<<>>> from Bio.SeqRecord import SeqRecord
1741
4.1 The SeqRecord object
1742
*=*=*=*=*=*=*=*=*=*=*=*=*
1745
The 'SeqRecord' (Sequence Record) class is defined in the
1746
'Bio.SeqRecord' module. This class allows higher level features such as
1747
identifiers and features to be associated with a sequence (see
1748
Chapter 3), and is the basic data type for the 'Bio.SeqIO' sequence
1749
input/output interface (see Chapter 5).
1750
The 'SeqRecord' class itself is quite simple, and offers the following
1751
information as attributes:
1754
seq -- The sequence itself, typically a 'Seq' object.
1756
id -- The primary ID used to identify the sequence -- a string. In
1757
most cases this is something like an accession number.
1759
name -- A "common" name/id for the sequence -- a string. In some cases
1760
this will be the same as the accession number, but it could also be a
1761
clone name. I think of this as being analagous to the LOCUS id in a
1764
description -- A human readible description or expressive name for the
1765
sequence -- a string.
1767
letter_annotations -- Holds per-letter-annotations using a
1768
(restricted) dictionary of additional information about the letters
1769
in the sequence. The keys are the name of the information, and the
1770
information is contained in the value as a Python sequence (i.e. a
1771
list, tuple or string) with the same length as the sequence itself.
1772
This is often used for quality scores (e.g. Section 14.1.3) or
1773
secondary structure information (e.g. from Stockholm/PFAM alignment
1776
annotations -- A dictionary of additional information about the
1777
sequence. The keys are the name of the information, and the
1778
information is contained in the value. This allows the addition of
1779
more "unstructed" information to the sequence.
1781
features -- A list of 'SeqFeature' objects with more structured
1782
information about the features on a sequence (e.g. position of genes
1783
on a genome, or domains on a protein sequence). The structure of
1784
sequence features is described below in Section 4.3.
1786
dbxrefs - A list of database cross-references as strings.
1790
4.2 Creating a SeqRecord
1791
*=*=*=*=*=*=*=*=*=*=*=*=*
1794
Using a 'SeqRecord' object is not very complicated, since all of the
1795
information is presented as attributes of the class. Usually you won't
1796
create a 'SeqRecord' "by hand", but instead use 'Bio.SeqIO' to read in a
1797
sequence file for you (see Chapter 5 and the examples below). However,
1798
creating 'SeqRecord' can be quite simple.
1801
4.2.1 SeqRecord objects from scratch
1802
=====================================
1804
To create a 'SeqRecord' at a minimum you just need a 'Seq' object:
1805
<<>>> from Bio.Seq import Seq
1806
>>> simple_seq = Seq("GATC")
1807
>>> from Bio.SeqRecord import SeqRecord
1808
>>> simple_seq_r = SeqRecord(simple_seq)
1811
Additionally, you can also pass the id, name and description to the
1812
initialization function, but if not they will be set as strings
1813
indicating they are unknown, and can be modified subsequently:
1814
<<>>> simple_seq_r.id
1816
>>> simple_seq_r.id = "AC12345"
1817
>>> simple_seq_r.description = "Made up sequence I wish I could write
1819
>>> print simple_seq_r.description
1820
Made up sequence I wish I could write a paper about
1821
>>> simple_seq_r.seq
1822
Seq('GATC', Alphabet())
1825
Including an identifier is very important if you want to output your
1826
'SeqRecord' to a file. You would normally include this when creating the
1828
<<>>> from Bio.Seq import Seq
1829
>>> simple_seq = Seq("GATC")
1830
>>> from Bio.SeqRecord import SeqRecord
1831
>>> simple_seq_r = SeqRecord(simple_seq, id="AC12345")
1834
As mentioned above, the 'SeqRecord' has an dictionary attribute
1835
'annotations'. This is used for any miscellaneous annotations that
1836
doesn't fit under one of the other more specific attributes. Adding
1837
annotations is easy, and just involves dealing directly with the
1838
annotation dictionary:
1839
<<>>> simple_seq_r.annotations["evidence"] = "None. I just made it up."
1840
>>> print simple_seq_r.annotations
1841
{'evidence': 'None. I just made it up.'}
1842
>>> print simple_seq_r.annotations["evidence"]
1843
None. I just made it up.
1846
Working with per-letter-annotations is similar, 'letter_annotations'
1847
is a dictionary like attribute which will let you assign any Python
1848
sequence (i.e. a string, list or tuple) which has the same length as the
1850
<<>>> simple_seq_r.letter_annotations["phred_quality"] = [40,40,38,30]
1851
>>> print simple_seq_r. letter_annotations
1852
{'phred_quality': [40, 40, 38, 30]}
1853
>>> print simple_seq_r.letter_annotations["phred_quality"]
1857
The 'dbxrefs' and 'features' attributes are just Python lists, and
1858
should be used to store strings and 'SeqFeature' objects (discussed
1859
later in this chapter) respectively.
1862
4.2.2 SeqRecord objects from FASTA files
1863
=========================================
1865
This example uses a fairly large FASTA file containing the whole
1866
sequence for Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1,
1867
originally downloaded from the NCBI. This file is included with the
1868
Biopython unit tests under the GenBank folder, or online
1869
NC_005816.fna (2) from our website.
1870
The file starts like this - and you can check there is only one record
1871
present (i.e. only one line starting with a greater than symbol):
1872
<<>gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ...
1873
pPCP1, complete sequence
1874
TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCC
1878
Back in Chapter 2 you will have seen the function
1879
'Bio.SeqIO.parse(...)' used to loop over all the records in a file as
1880
'SeqRecord' objects. The 'Bio.SeqIO' module has a sister function for
1881
use on files which contain just one record which we'll use here (see
1882
Chapter 5 for details):
1883
<<>>> from Bio import SeqIO
1884
>>> record = SeqIO.read(open("NC_005816.fna"), "fasta")
1886
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCC
1888
SingleLetterAlphabet()), id='gi|45478711|ref|NC_005816.1|',
1889
name='gi|45478711|ref|NC_005816.1|',
1890
description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar
1891
Microtus ... sequence',
1895
Now, let's have a look at the key attributes of this 'SeqRecord'
1896
individually -- starting with the 'seq' attribute which gives you a
1899
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
1900
SingleLetterAlphabet())
1903
Here 'Bio.SeqIO' has defaulted to a generic alphabet, rather than
1904
guessing that this is DNA. If you know in advance what kind of sequence
1905
your FASTA file contains, you can tell 'Bio.SeqIO' which alphabet to use
1907
Next, the identifiers and description:
1909
'gi|45478711|ref|NC_005816.1|'
1911
'gi|45478711|ref|NC_005816.1|'
1912
>>> record.description
1913
'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ...
1914
pPCP1, complete sequence'
1917
As you can see above, the first word of the FASTA record's title line
1918
(after removing the greater than symbol) is used for both the 'id' and
1919
'name' attributes. The whole title line (after removing the greater than
1920
symbol) is used for the record description. This is deliberate, partly
1921
for backwards compatibility reasons, but it also makes sense if you have
1922
a FASTA file like this:
1923
<<>Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1
1924
TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCC
1928
Note that none of the other annotation attributes get populated when
1929
reading a FASTA file:
1930
<<>>> record.dbxrefs
1932
>>> record.annotations
1934
>>> record.letter_annotations
1940
In this case our example FASTA file was from the NCBI, and they have a
1941
fairly well defined set of conventions for formatting their FASTA lines.
1942
This means it would be possible to parse this information and extract
1943
the GI number and accession for example. However, FASTA files from other
1944
sources vary, so this isn't possible in general.
1947
4.2.3 SeqRecord objects from GenBank files
1948
===========================================
1950
As in the previous example, we're going to look at the whole sequence
1951
for Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, originally
1952
downloaded from the NCBI, but this time as a GenBank file. Again, this
1953
file is included with the Biopython unit tests under the GenBank folder,
1954
or online NC_005816.gb (3) from our website.
1955
This file contains a single record (i.e. only one LOCUS line) and
1957
<<LOCUS NC_005816 9609 bp DNA circular BCT
1959
DEFINITION Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1,
1963
VERSION NC_005816.1 GI:45478711
1964
PROJECT GenomeProject:10638
1968
Again, we'll use 'Bio.SeqIO' to read this file in, and the code is
1969
almost identical to that for used above for the FASTA file (see
1970
Chapter 5 for details):
1971
<<>>> from Bio import SeqIO
1972
>>> record = SeqIO.read(open("NC_005816.gb"), "genbank")
1974
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCC
1976
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
1977
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1,
1978
complete sequence.',
1979
dbxrefs=['Project:10638'])
1982
You should be able to spot some differences already! But taking the
1983
attributes individually, the sequence string is the same as before, but
1984
this time 'Bio.SeqIO' has been able to automatically assign a more
1985
specific alphabet (see Chapter 5 for details):
1987
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
1988
IUPACAmbiguousDNA())
1991
The 'name' comes from the LOCUS line, while the 'id' includes the
1992
version suffix. The description comes from the DEFINITION line:
1997
>>> record.description
1998
'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete
2002
GenBank files don't have any per-letter annotations:
2003
<<>>> record.letter_annotations
2007
Most of the annotations information gets recorded in the 'annotations'
2008
dictionary, for example:
2009
<<>>> len(record.annotations)
2011
>>> record.annotations["source"]
2012
'Yersinia pestis biovar Microtus str. 91001'
2015
The 'dbxrefs' list gets populated from any PROJECT or DBLINK lines:
2016
<<>>> record.dbxrefs
2020
Finally, and perhaps most interestingly, all the entries in the
2021
features table (e.g. the genes or CDS features) get recorded as
2022
'SeqFeature' objects in the 'features' list.
2023
<<>>> len(record.features)
2027
We'll talk about 'SeqFeature' objects next, in Section 4.3.
2030
4.3 SeqFeature objects
2031
*=*=*=*=*=*=*=*=*=*=*=*
2034
Sequence features are an essential part of describing a sequence. Once
2035
you get beyond the sequence itself, you need some way to organize and
2036
easily get at the more "abstract" information that is known about the
2037
sequence. While it is probably impossible to develop a general sequence
2038
feature class that will cover everything, the Biopython 'SeqFeature'
2039
class attempts to encapsulate as much of the information about the
2040
sequence as possible. The design is heavily based on the GenBank/EMBL
2041
feature tables, so if you understand how they look, you'll probably have
2042
an easier time grasping the structure of the Biopython classes.
2045
4.3.1 SeqFeatures themselves
2046
=============================
2048
The first level of dealing with sequence features is the 'SeqFeature'
2049
class itself. This class has a number of attributes, so first we'll list
2050
them and their general features, and then work through an example to
2051
show how this applies to a real life example, a GenBank feature table.
2052
The attributes of a SeqFeature are:
2055
location -- The location of the 'SeqFeature' on the sequence that you
2056
are dealing with. The locations end-points may be fuzzy --
2057
section 4.3.2 has a lot more description on how to deal with
2060
type -- This is a textual description of the type of feature (for
2061
instance, this will be something like `CDS' or `gene').
2063
ref -- A reference to a different sequence. Some times features may be
2064
"on" a particular sequence, but may need to refer to a different
2065
sequence, and this provides the reference (normally an accession
2066
number). A good example of this is a genomic sequence that has most
2067
of a coding sequence, but one of the exons is on a different
2068
accession. In this case, the feature would need to refer to this
2069
different accession for this missing exon. You are most likely to see
2070
this in contig GenBank files.
2072
ref_db -- This works along with 'ref' to provide a cross sequence
2073
reference. If there is a reference, 'ref_db' will be set as None if
2074
the reference is in the same database, and will be set to the name of
2075
the database otherwise.
2077
strand -- The strand on the sequence that the feature is located on.
2078
This may either be 1 for the top strand, -1 for the bottom strand, or
2079
0 or None for both strands (or if it doesn't matter). Keep in mind
2080
that this only really makes sense for double stranded DNA, and not
2081
for proteins or RNA.
2083
qualifiers -- This is a Python dictionary of additional information
2084
about the feature. The key is some kind of terse one-word description
2085
of what the information contained in the value is about, and the
2086
value is the actual information. For example, a common key for a
2087
qualifier might be "evidence" and the value might be "computational
2088
(non-experimental)." This is just a way to let the person who is
2089
looking at the feature know that it has not be experimentally
2090
(i. e. in a wet lab) confirmed. Note that other the value will be a
2091
list of strings (even when there is only one string). This is a
2092
reflection of the feature tables in GenBank/EMBL files.
2094
sub_features -- A very important feature of a feature is that it can
2095
have additional 'sub_features' underneath it. This allows nesting of
2096
features, and helps us to deal with things such as the GenBank/EMBL
2097
feature lines in a (we hope) intuitive way.
2099
To show an example of SeqFeatures in action, let's take a look at the
2100
following feature from a GenBank feature table:
2101
<< mRNA complement(join(<49223..49300,49780..>50208))
2105
To look at the easiest attributes of the 'SeqFeature' first, if you
2106
got a 'SeqFeature' object for this it would have it 'type' of 'mRNA', a
2107
'strand' of -1 (due to the `complement'), and would have None for the
2108
'ref' and 'ref_db' since there are no references to external databases.
2109
The 'qualifiers' for this SeqFeature would be a Python dictionarary that
2110
looked like '{'gene' : ['F28B23.12']}'.
2111
Now let's look at the more tricky part, how the `join' in the location
2112
line is handled. First, the location for the top level 'SeqFeature' (the
2113
one we are dealing with right now) is set as going from '`<49223' to
2114
`>50208'' (see section 4.3.2 for the nitty gritty on how fuzzy locations
2115
like this are handled). So the location of the top level object is the
2116
entire span of the feature. So, how do you get at the information in the
2117
`join'? Well, that's where the 'sub_features' go in.
2118
The 'sub_features' attribute will have a list with two 'SeqFeature'
2119
objects in it, and these contain the information in the join. Let's look
2120
at 'top_level_feature.sub_features[0]' (the first 'sub_feature'). This
2121
object is a 'SeqFeature' object with a 'type' of `'mRNA',' a 'strand' of
2122
-1 (inherited from the parent 'SeqFeature') and a location going from
2123
''<49223' to '49300''.
2124
So, the 'sub_features' allow you to get at the internal information if
2125
you want it (i. e. if you were trying to get only the exons out of a
2126
genomic sequence), or just to deal with the broad picture (i. e. you
2127
just want to know that the coding sequence for a gene lies in a region).
2128
Hopefully this structuring makes it easy and intuitive to get at the
2129
sometimes complex information that can be contained in a 'SeqFeature'.
2135
In the section on SeqFeatures above, we skipped over one of the more
2136
difficult parts of features, dealing with the locations. The reason this
2137
can be difficult is because of fuzziness of the positions in locations.
2138
Before we get into all of this, let's just define the vocabulary we'll
2139
use to talk about this. Basically there are two terms we'll use:
2142
position -- This refers to a single position on a sequence, which may
2143
be fuzzy or not. For instance, 5, 20, '<100' and '3^5' are all
2146
location -- A location is two positions that defines a region of a
2147
sequence. For instance 5..20 (i. e. 5 to 20) is a location.
2149
I just mention this because sometimes I get confused between the two.
2150
The complication in dealing with locations comes in the positions
2151
themselves. In biology many times things aren't entirely certain (as
2152
much as us wet lab biologists try to make them certain!). For instance,
2153
you might do a dinucleotide priming experiment and discover that the
2154
start of mRNA transcript starts at one of two sites. This is very useful
2155
information, but the complication comes in how to represent this as a
2156
position. To help us deal with this, we have the concept of fuzzy
2157
positions. Basically there are five types of fuzzy positions, so we have
2158
five classes do deal with them:
2161
ExactPosition -- As its name suggests, this class represents a
2162
position which is specified as exact along the sequence. This is
2163
represented as just a a number, and you can get the position by
2164
looking at the 'position' attribute of the object.
2166
BeforePosition -- This class represents a fuzzy position that occurs
2167
prior to some specified site. In GenBank/EMBL notation, this is
2168
represented as something like '`<13'', signifying that the real
2169
position is located somewhere less then 13. To get the specified
2170
upper boundary, look at the 'position' attribute of the object.
2172
AfterPosition -- Contrary to 'BeforePosition', this class represents a
2173
position that occurs after some specified site. This is represented
2174
in GenBank as '`>13'', and like 'BeforePosition', you get the
2175
boundary number by looking at the 'position' attribute of the object.
2177
WithinPosition -- This class models a position which occurs somewhere
2178
between two specified nucleotides. In GenBank/EMBL notation, this
2179
would be represented as `(1.5)', to represent that the position is
2180
somewhere within the range 1 to 5. To get the information in this
2181
class you have to look at two attributes. The 'position' attribute
2182
specifies the lower boundary of the range we are looking at, so in
2183
our example case this would be one. The 'extension' attribute
2184
specifies the range to the higher boundary, so in this case it would
2185
be 4. So 'object.position' is the lower boundary and 'object.position
2186
+ object.extension' is the upper boundary.
2188
BetweenPosition -- This class deals with a position that occurs
2189
between two coordinates. For instance, you might have a protein
2190
binding site that occurs between two nucleotides on a sequence. This
2191
is represented as '`2^3'', which indicates that the real position
2192
happens between position 2 and 3. Getting this information from the
2193
object is very similar to 'WithinPosition', the 'position' attribute
2194
specifies the lower boundary (2, in this case) and the 'extension'
2195
indicates the range to the higher boundary (1 in this case).
2197
Now that we've got all of the types of fuzzy positions we can have
2198
taken care of, we are ready to actually specify a location on a
2199
sequence. This is handled by the 'FeatureLocation' class. An object of
2200
this type basically just holds the potentially fuzzy start and end
2201
positions of a feature. You can create a 'FeatureLocation' object by
2202
creating the positions and passing them in:
2203
<<>>> from Bio import SeqFeature
2204
>>> start_pos = SeqFeature.AfterPosition(5)
2205
>>> end_pos = SeqFeature.BetweenPosition(8, 1)
2206
>>> my_location = SeqFeature.FeatureLocation(start_pos, end_pos)
2209
If you print out a 'FeatureLocation' object, you can get a nice
2210
representation of the information:
2211
<<>>> print my_location
2215
We can access the fuzzy start and end positions using the start and
2216
end attributes of the location:
2217
<<>>> my_location.start
2218
Bio.SeqFeature.AfterPosition(5)
2219
>>> print my_location.start
2222
Bio.SeqFeature.BetweenPosition(8,1)
2223
>>> print my_location.end
2227
If you don't want to deal with fuzzy positions and just want numbers,
2228
you just need to ask for the 'nofuzzy_start' and 'nofuzzy_end'
2229
attributes of the location:
2230
<<>>> my_location.nofuzzy_start
2232
>>> my_location.nofuzzy_end
2236
Notice that this just gives you back the position attributes of the
2238
Similary, to make it easy to create a position without worrying about
2239
fuzzy positions, you can just pass in numbers to the 'FeaturePosition'
2240
constructors, and you'll get back out 'ExactPosition' objects:
2241
<<>>> exact_location = SeqFeature.FeatureLocation(5, 8)
2242
>>> print exact_location
2244
>>> exact_location.start
2245
Bio.SeqFeature.ExactPosition(5)
2246
>>> exact_location.nofuzzy_start
2250
That is all of the nitty gritty about dealing with fuzzy positions in
2251
Biopython. It has been designed so that dealing with fuzziness is not
2252
that much more complicated than dealing with exact positions, and
2253
hopefully you find that true!
2260
Another common annotation related to a sequence is a reference to a
2261
journal or other published work dealing with the sequence. We have a
2262
fairly simple way of representing a Reference in Biopython -- we have a
2263
'Bio.SeqFeature.Reference' class that stores the relevant information
2264
about a reference as attributes of an object.
2265
The attributes include things that you would expect to see in a
2266
reference like 'journal', 'title' and 'authors'. Additionally, it also
2267
can hold the 'medline_id' and 'pubmed_id' and a 'comment' about the
2268
reference. These are all accessed simply as attributes of the object.
2269
A reference also has a 'location' object so that it can specify a
2270
particular location on the sequence that the reference refers to. For
2271
instance, you might have a journal that is dealing with a particular
2272
gene located on a BAC, and want to specify that it only refers to this
2273
position exactly. The 'location' is a potentially fuzzy location, as
2274
described in section 4.3.2.
2275
Any reference objects are stored as a list in the 'SeqRecord' object's
2276
'annotations' dictionary under the key "references". That's all there is
2277
too it. References are meant to be easy to deal with, and hopefully
2278
general enough to cover lots of usage cases.
2281
4.5 The format method
2282
*=*=*=*=*=*=*=*=*=*=*=
2285
Biopython 1.48 added a new 'format()' method to the 'SeqRecord' class
2286
which gives a string containing your record formatted using one of the
2287
output file formats supported by 'Bio.SeqIO', such as FASTA:
2288
<<from Bio.Seq import Seq
2289
from Bio.SeqRecord import SeqRecord
2290
from Bio.Alphabet import generic_protein
2293
SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQAL
2296
+"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \
2298
+"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \
2299
+"SSAC", generic_protein),
2300
id="gi|14150838|gb|AAK54648.1|AF376133_1",
2301
description="chalcone synthase [Cucumis sativus]")
2303
print record.format("fasta")
2306
<<>gi|14150838|gb|AAK54648.1|AF376133_1 chalcone synthase [Cucumis
2308
MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD
2309
GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK
2310
NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM
2314
This 'format' method takes a single mandatory argument, a lower case
2315
string which is supported by 'Bio.SeqIO' as an output format (see
2316
Chapter 5). However, some of the file formats 'Bio.SeqIO' can write to
2317
require more than one record (typically the case for multiple sequence
2318
alignment formats), and thus won't work via this 'format()' method. See
2322
4.6 Slicing a SeqRecord
2323
*=*=*=*=*=*=*=*=*=*=*=*=
2326
One of the new features in Biopython 1.50 was the ability to slice a
2327
'SeqRecord', to give you a new 'SeqRecord' covering just part of the
2328
sequence. What is important here is that any per-letter annotations are
2329
also sliced, and any features which fall completely within the new
2330
sequence are preserved (with their locations adjusted).
2331
For example, taking the same GenBank file used earlier:
2332
<<>>> from Bio import SeqIO
2333
>>> record = SeqIO.read(open("NC_005816.gb"), "genbank")
2335
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCC
2337
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
2338
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1,
2339
complete sequence.',
2340
dbxrefs=['Project:10638'])
2343
>>> len(record.features)
2347
For this example we're going to focus in on the 'pim' gene,
2348
'YP_pPCP05'. If you have a look at the GenBank file directly you'll find
2349
this gene/CDS has location string 4343..4780, or in Python counting
2350
4342:4780. From looking at the file you can work out that these are the
2351
twelfth and thirteenth entries in the file, so in Python zero-based
2352
counting they are entries 11 and 12 in the features list:
2353
<<>>> print record.features[11]
2355
location: [4342:4780]
2359
Key: db_xref, Value: ['GeneID:2767712']
2360
Key: gene, Value: ['pim']
2361
Key: locus_tag, Value: ['YP_pPCP05']
2363
>>> print record.features[12]
2365
location: [4342:4780]
2369
Key: codon_start, Value: ['1']
2370
Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']
2371
Key: gene, Value: ['pim']
2372
Key: locus_tag, Value: ['YP_pPCP05']
2373
Key: note, Value: ['similar to many previously sequenced pesticin
2375
Key: product, Value: ['pesticin immunity protein']
2376
Key: protein_id, Value: ['NP_995571.1']
2377
Key: transl_table, Value: ['11']
2378
Key: translation, Value:
2379
['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']
2382
Let's slice this parent record from 4300 to 4800 (enough to include
2383
the 'pim' gene/CDS), and see how many features we get:
2384
<<>>> sub_record = record[4300:4800]
2386
SeqRecord(seq=Seq('ATAAATAGATTATTCCAAATAATTTATTTATGTAAGAACAGGATGGGAGGG
2388
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
2389
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1,
2390
complete sequence.',
2394
>>> len(sub_record.features)
2398
Our sub-record just has two features, the gene and CDS entries for
2400
<<>>> print sub_record.features[0]
2406
Key: db_xref, Value: ['GeneID:2767712']
2407
Key: gene, Value: ['pim']
2408
Key: locus_tag, Value: ['YP_pPCP05']
2410
>>> print sub_record.features[1]
2416
Key: codon_start, Value: ['1']
2417
Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']
2418
Key: gene, Value: ['pim']
2419
Key: locus_tag, Value: ['YP_pPCP05']
2420
Key: note, Value: ['similar to many previously sequenced pesticin
2422
Key: product, Value: ['pesticin immunity protein']
2423
Key: protein_id, Value: ['NP_995571.1']
2424
Key: transl_table, Value: ['11']
2425
Key: translation, Value:
2426
['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']
2429
Notice that their locations have been adjusted to reflect the new
2431
While Biopython has done something sensible and hopefully intuitive
2432
with the features (and any per-letter annotation), for the other
2433
annotation it is impossible to know if this still applies to the
2434
sub-sequence or not. To avoid guessing, the annotations and dbxrefs are
2435
omitted from the sub-record, and it is up to you to transfer any
2436
relevant information as appropriate.
2437
<<>>> sub_record.annotations
2439
>>> sub_record.dbxrefs
2443
The same point could be made about the record id, name and
2444
description, but for practicality these are preserved:
2449
>>> sub_record.description
2450
'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete
2454
This illustrates the problem nicely though, our new sub-record is not
2455
the complete sequence of the plasmid, so the description is wrong! Let's
2456
fix this and then view the sub-record as a reduced GenBank file using
2457
the format method described above in Section 4.5:
2458
<<>>> sub_record.description = "Yersinia pestis biovar Microtus str.
2459
91001 plasmid pPCP1, partial."
2460
>>> print sub_record.format("genbank")
2464
See Sections 14.1.4 and 14.1.5 for some FASTQ example where the
2465
per-letter annotations (the read quality scores) are also sliced.
2466
-----------------------------------
2469
(1) http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.ht
2472
(2) http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.fna
2474
(3) http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.gb
2477
Chapter 5 Sequence Input/Output
2478
**********************************
2480
In this chapter we'll discuss in more detail the 'Bio.SeqIO' module,
2481
which was briefly introduced in Chapter 2 and also used in Chapter 4.
2482
This aims to provide a simple interface for working with assorted
2483
sequence file formats in a uniform way. See also the 'Bio.SeqIO' wiki
2484
page (http://biopython.org/wiki/SeqIO), and the built in documentation
2486
<<>>> from Bio import SeqIO
2491
The "catch" is that you have to work with 'SeqRecord' objects (see
2492
Chapter 4), which contain a 'Seq' object (see Chapter 3) plus annotation
2493
like an identifier and description.
2496
5.1 Parsing or Reading Sequences
2497
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
2500
The workhorse function 'Bio.SeqIO.parse()' is used to read in sequence
2501
data as SeqRecord objects. This function expects two arguments:
2504
1. The first argument is a handle to read the data from. A handle is
2505
typically a file opened for reading, but could be the output from a
2506
command line program, or data downloaded from the internet (see
2507
Section 5.2). See Section 18.1 for more about handles.
2508
2. The second argument is a lower case string specifying sequence
2509
format -- we don't try and guess the file format for you! See
2510
http://biopython.org/wiki/SeqIO for a full listing of supported
2513
As of Biopython 1.49 there is an optional argument 'alphabet' to
2514
specify the alphabet to be used. This is useful for file formats like
2515
FASTA where otherwise 'Bio.SeqIO' will default to a generic alphabet.
2516
The 'Bio.SeqIO.parse()' function returns an iterator which gives
2517
'SeqRecord' objects. Iterators are typically used in a for loop as shown
2519
Sometimes you'll find yourself dealing with files which contain only a
2520
single record. For this situation Biopython 1.45 introduced the function
2521
'Bio.SeqIO.read()' which takes the same arguments. Provided there is one
2522
and only one record in the file, this is returned as a 'SeqRecord'
2523
object. Otherwise an exception is raised.
2526
5.1.1 Reading Sequence Files
2527
=============================
2529
In general 'Bio.SeqIO.parse()' is used to read in sequence files as
2530
'SeqRecord' objects, and is typically used with a for loop like this:
2531
<<from Bio import SeqIO
2532
handle = open("ls_orchid.fasta")
2533
for seq_record in SeqIO.parse(handle, "fasta") :
2535
print repr(seq_record.seq)
2536
print len(seq_record)
2540
The above example is repeated from the introduction in Section 2.4,
2541
and will load the orchid DNA sequences in the FASTA format file
2542
ls_orchid.fasta (2). If instead you wanted to load a GenBank format file
2543
like ls_orchid.gbk (3) then all you need to do is change the filename
2544
and the format string:
2545
<<from Bio import SeqIO
2546
handle = open("ls_orchid.gbk")
2547
for seq_record in SeqIO.parse(handle, "genbank") :
2549
print seq_record.seq
2550
print len(seq_record)
2554
Similarly, if you wanted to read in a file in another file format,
2555
then assuming 'Bio.SeqIO.parse()' supports it you would just need to
2556
change the format string as appropriate, for example "swiss" for
2557
SwissProt files or "embl" for EMBL text files. There is a full listing
2558
on the wiki page (http://biopython.org/wiki/SeqIO) and in the built in
2559
documentation (also online (4)).
2560
Another very common way to use a Python iterator is within a list
2561
comprehension (or a generator expression). For example, if all you
2562
wanted to extract from the file was a list of the record identifiers we
2563
can easily do this with the following list comprehension:
2564
<<>>> from Bio import SeqIO
2565
>>> handle = open("ls_orchid.gbk")
2566
>>> identifiers = [seq_record.id for seq_record in SeqIO.parse(handle,
2570
['Z78533.1', 'Z78532.1', 'Z78531.1', 'Z78530.1', 'Z78529.1',
2571
'Z78527.1', ..., 'Z78439.1']
2574
There are more examples using 'SeqIO.parse()' in a list comprehension
2575
like this in Section 14.2 (e.g. for plotting sequence lengths or GC%).
2578
5.1.2 Iterating over the records in a sequence file
2579
====================================================
2581
In the above examples, we have usually used a for loop to iterate over
2582
all the records one by one. You can use the for loop with all sorts of
2583
Python objects (including lists, tuples and strings) which support the
2584
iteration interface.
2585
The object returned by 'Bio.SeqIO' is actually an iterator which
2586
returns 'SeqRecord' objects. You get to see each record in turn, but
2587
once and only once. The plus point is that an iterator can save you
2588
memory when dealing with large files.
2589
Instead of using a for loop, can also use the '.next()' method of an
2590
iterator to step through the entries, like this:
2591
<<from Bio import SeqIO
2592
handle = open("ls_orchid.fasta")
2593
record_iterator = SeqIO.parse(handle, "fasta")
2595
first_record = record_iterator.next()
2596
print first_record.id
2597
print first_record.description
2599
second_record = record_iterator.next()
2600
print second_record.id
2601
print second_record.description
2606
Note that if you try and use '.next()' and there are no more results,
2607
you'll either get back the special Python object 'None' or a
2608
'StopIteration' exception.
2609
One special case to consider is when your sequence files have multiple
2610
records, but you only want the first one. In this situation the
2611
following code is very concise:
2612
<<from Bio import SeqIO
2613
first_record = SeqIO.parse(open("ls_orchid.gbk"), "genbank").next()
2616
A word of warning here -- using the '.next()' method like this will
2617
silently ignore any additional records in the file. If your files have
2618
one and only one record, like some of the online examples later in this
2619
chapter, or a GenBank file for a single chromosome, then use the new
2620
'Bio.SeqIO.read()' function instead. This will check there are no extra
2621
unexpected records present.
2624
5.1.3 Getting a list of the records in a sequence file
2625
=======================================================
2627
In the previous section we talked about the fact that
2628
'Bio.SeqIO.parse()' gives you a 'SeqRecord' iterator, and that you get
2629
the records one by one. Very often you need to be able to access the
2630
records in any order. The Python 'list' data type is perfect for this,
2631
and we can turn the record iterator into a list of 'SeqRecord' objects
2632
using the built-in Python function 'list()' like so:
2633
<<from Bio import SeqIO
2634
handle = open("ls_orchid.gbk")
2635
records = list(SeqIO.parse(handle, "genbank"))
2638
print "Found %i records" % len(records)
2640
print "The last record"
2641
last_record = records[-1] #using Python's list tricks
2642
print last_record.id
2643
print repr(last_record.seq)
2644
print len(last_record)
2646
print "The first record"
2647
first_record = records[0] #remember, Python counts from zero
2648
print first_record.id
2649
print repr(first_record.seq)
2650
print len(first_record)
2657
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC',
2658
IUPACAmbiguousDNA())
2662
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',
2663
IUPACAmbiguousDNA())
2667
You can of course still use a for loop with a list of 'SeqRecord'
2668
objects. Using a list is much more flexible than an iterator (for
2669
example, you can determine the number of records from the length of the
2670
list), but does need more memory because it will hold all the records in
2674
5.1.4 Extracting data
2675
======================
2677
The 'SeqRecord' object and its annotation structures are described
2678
more fully in Chapter 4. As an example of how annotations are stored,
2679
we'll look at the output from parsing the first record in the GenBank
2680
file ls_orchid.gbk (5).
2681
<<from Bio import SeqIO
2682
record_iterator = SeqIO.parse(open("ls_orchid.gbk"), "genbank")
2683
first_record = record_iterator.next()
2687
That should give something like this:
2690
Description: C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA.
2691
Number of features: 5
2693
/source=Cypripedium irapeanum
2694
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', ...,
2696
/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', ..., 'ITS1',
2699
/accessions=['Z78533']
2700
/data_file_division=PLN
2702
/organism=Cypripedium irapeanum
2704
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',
2705
IUPACAmbiguousDNA())
2708
This gives a human readable summary of most of the annotation data for
2709
the 'SeqRecord'. For this example we're going to use the '.annotations'
2710
attribute which is just a Python dictionary. The contents of this
2711
annotations dictionary were shown when we printed the record above. You
2712
can also print them out directly:
2713
<<print first_record.annotations
2715
Like any Python dictionary, you can easily get a list of the keys:
2716
<<print first_record.annotations.keys()
2719
<<print first_record.annotations.values()
2722
In general, the annotation values are strings, or lists of strings.
2723
One special case is any references in the file get stored as reference
2725
Suppose you wanted to extract a list of the species from the
2726
ls_orchid.gbk (6) GenBank file. The information we want, Cypripedium
2727
irapeanum, is held in the annotations dictionary under `source' and
2728
`organism', which we can access like this:
2729
<<>>> print first_record.annotations["source"]
2730
Cypripedium irapeanum
2734
<<>>> print first_record.annotations["organism"]
2735
Cypripedium irapeanum
2738
In general, `organism' is used for the scientific name (in Latin, e.g.
2739
Arabidopsis thaliana), while `source' will often be the common name
2740
(e.g. thale cress). In this example, as is often the case, the two
2741
fields are identical.
2742
Now let's go through all the records, building up a list of the
2743
species each orchid sequence is from:
2744
<<from Bio import SeqIO
2745
handle = open("ls_orchid.gbk")
2747
for seq_record in SeqIO.parse(handle, "genbank") :
2748
all_species.append(seq_record.annotations["organism"])
2753
Another way of writing this code is to use a list comprehension:
2754
<<from Bio import SeqIO
2755
all_species = [seq_record.annotations["organism"] for seq_record in \
2756
SeqIO.parse(open("ls_orchid.gbk"), "genbank")]
2760
In either case, the result is:
2761
<<['Cypripedium irapeanum', 'Cypripedium californicum', ...,
2762
'Paphiopedilum barbatum']
2765
Great. That was pretty easy because GenBank files are annotated in a
2767
Now, let's suppose you wanted to extract a list of the species from a
2768
FASTA file, rather than the GenBank file. The bad news is you will have
2769
to write some code to extract the data you want from the record's
2770
description line - if the information is in the file in the first place!
2771
Our example FASTA format file ls_orchid.fasta (7) starts like this:
2772
<<>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1
2774
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
2775
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
2779
You can check by hand, but for every record the species name is in the
2780
description line as the second word. This means if we break up each
2781
record's '.description' at the spaces, then the species is there as
2782
field number one (field zero is the record identifier). That means we
2784
<<from Bio import SeqIO
2785
handle = open("ls_orchid.fasta")
2787
for seq_record in SeqIO.parse(handle, "fasta") :
2788
all_species.append(seq_record.description.split()[1])
2794
<<['C.irapeanum', 'C.californicum', 'C.fasciculatum', 'C.margaritaceum',
2798
The concise alternative using list comprehensions would be:
2799
<<from Bio import SeqIO
2800
all_species == [seq_record.description.split()[1] for seq_record in \
2801
SeqIO.parse(open("ls_orchid.fasta"), "fasta")]
2805
In general, extracting information from the FASTA description line is
2806
not very nice. If you can get your sequences in a well annotated file
2807
format like GenBank or EMBL, then this sort of annotation information is
2808
much easier to deal with.
2811
5.2 Parsing sequences from the net
2812
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
2814
In the previous section, we looked at parsing sequence data from a
2815
file handle. We hinted that handles are not always from files, and in
2816
this section we'll use handles to internet connections to download
2818
Note that just because you can download sequence data and parse it
2819
into a 'SeqRecord' object in one go doesn't mean this is a good idea. In
2820
general, you should probably download sequences once and save them to a
2824
5.2.1 Parsing GenBank records from the net
2825
===========================================
2826
Section 8.6 talks about the Entrez EFetch interface in more detail,
2827
but for now let's just connect to the NCBI and get a few Opuntia
2828
(prickly-pear) sequences from GenBank using their GI numbers.
2829
First of all, let's fetch just one record. If you don't care about the
2830
annotations and features downloading a FASTA file is a good choice as
2831
these are compact. Now remember, when you expect the handle to contain
2832
one and only one record, use the 'Bio.SeqIO.read()' function:
2833
<<from Bio import Entrez
2834
from Bio import SeqIO
2835
handle = Entrez.efetch(db="nucleotide", rettype="fasta", id="6273291")
2836
seq_record = SeqIO.read(handle, "fasta")
2838
print "%s with %i features" % (seq_record.id,
2839
len(seq_record.features))
2843
<<gi|6273291|gb|AF191665.1|AF191665 with 0 features
2846
The NCBI will also let you ask for the file in other formats, in
2847
particular as a GenBank file. Until Easter 2009, the Entrez EFetch API
2848
let you use "genbank" as the return type, however the NCBI now insist on
2849
using the official return types of "gb" (or "gp" for proteins) as
2850
described on EFetch for Sequence and other Molecular Biology
2851
Databases (8). As a result, in Biopython 1.50 onwards, we support "gb"
2852
as an alias for "genbank" in 'Bio.SeqIO'.
2853
<<from Bio import Entrez
2854
from Bio import SeqIO
2855
handle = Entrez.efetch(db="nucleotide", rettype="gb", id="6273291")
2856
seq_record = SeqIO.read(handle, "gb") #using "gb" as an alias for
2859
print "%s with %i features" % (seq_record.id,
2860
len(seq_record.features))
2863
The expected output of this example is:
2864
<<AF191665.1 with 3 features
2867
Notice this time we have three features.
2868
Now let's fetch several records. This time the handle contains
2869
multiple records, so we must use the 'Bio.SeqIO.parse()' function:
2870
<<from Bio import Entrez
2871
from Bio import SeqIO
2872
handle = Entrez.efetch(db="nucleotide", rettype="gb", \
2873
id="6273291,6273290,6273289")
2874
for seq_record in SeqIO.parse(handle, "gb") :
2875
print seq_record.id, seq_record.description[:50] + "..."
2876
print "Sequence length %i," % len(seq_record),
2877
print "%i features," % len(seq_record.features),
2878
print "from: %s" % seq_record.annotations["source"]
2882
That should give the following output:
2883
<<AF191665.1 Opuntia marenae rpl16 gene; chloroplast gene for c...
2884
Sequence length 902, 3 features, from: chloroplast Opuntia marenae
2885
AF191664.1 Opuntia clavata rpl16 gene; chloroplast gene for c...
2886
Sequence length 899, 3 features, from: chloroplast Grusonia clavata
2887
AF191663.1 Opuntia bradtiana rpl16 gene; chloroplast gene for...
2888
Sequence length 899, 3 features, from: chloroplast Opuntia bradtianaa
2891
See Chapter 8 for more about the 'Bio.Entrez' module, and make sure to
2892
read about the NCBI guidelines for using Entrez (Section 8.1).
2895
5.2.2 Parsing SwissProt sequences from the net
2896
===============================================
2897
Now let's use a handle to download a SwissProt file from ExPASy,
2898
something covered in more depth in Chapter 9. As mentioned above, when
2899
you expect the handle to contain one and only one record, use the
2900
'Bio.SeqIO.read()' function:
2901
<<from Bio import ExPASy
2902
from Bio import SeqIO
2903
handle = ExPASy.get_sprot_raw("O23729")
2904
seq_record = SeqIO.read(handle, "swiss")
2907
print seq_record.name
2908
print seq_record.description
2909
print repr(seq_record.seq)
2910
print "Length %i" % len(seq_record)
2911
print seq_record.annotations["keywords"]
2914
Assuming your network connection is OK, you should get back:
2917
RecName: Full=Chalcone synthase 3; EC=2.3.1.74; AltName:
2918
Full=Naringenin-chalcone synthase 3;
2919
Seq('MAPAMEEIRQAQRAEGPAAVLAIGTSTPPNALYQADYPDYYFRITKSEHLTELK...GAE',
2922
['Acyltransferase', 'Flavonoid biosynthesis', 'Transferase']
2927
5.3 Sequence files as Dictionaries
2928
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
2931
The next thing that we'll do with our ubiquitous orchid files is to
2932
show how to index them and access them like a database using the Python
2933
'dictionary' data type (like a hash in Perl). This is very useful for
2934
moderately large files where you only need to access certain elements of
2935
the file, and makes for a nice quick 'n dirty database.
2936
You can use the function 'Bio.SeqIO.to_dict()' to make a SeqRecord
2937
dictionary (in memory). By default this will use each record's
2938
identifier (i.e. the '.id' attribute) as the key. Let's try this using
2940
<<from Bio import SeqIO
2941
handle = open("ls_orchid.gbk")
2942
orchid_dict = SeqIO.to_dict(SeqIO.parse(handle, "genbank"))
2946
Since this variable 'orchid_dict' is an ordinary Python dictionary, we
2947
can look at all of the keys we have available:
2948
<<>>> print orchid_dict.keys()
2949
['Z78484.1', 'Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1',
2950
'Z78453.1', ..., 'Z78471.1']
2953
We can access a single 'SeqRecord' object via the keys and manipulate
2954
the object as normal:
2955
<<>>> seq_record = orchid_dict["Z78475.1"]
2956
>>> print seq_record.description
2957
P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA
2958
>>> print repr(seq_record.seq)
2959
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT',
2960
IUPACAmbiguousDNA())
2963
So, it is very easy to create an in memory "database" of our GenBank
2964
records. Next we'll try this for the FASTA file instead.
2965
Note that those of you with prior Python experience should all be able
2966
to construct a dictionary like this "by hand". However, typical
2967
dictionary construction methods will not deal with the case of repeated
2968
keys very nicely. Using the 'Bio.SeqIO.to_dict()' will explicitly check
2969
for duplicate keys, and raise an exception if any are found.
2972
5.3.1 Specifying the dictionary keys
2973
=====================================
2975
Using the same code as above, but for the FASTA file instead:
2976
<<from Bio import SeqIO
2977
handle = open("ls_orchid.fasta")
2978
orchid_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
2980
print orchid_dict.keys()
2983
This time the keys are:
2984
<<['gi|2765596|emb|Z78471.1|PDZ78471',
2985
'gi|2765646|emb|Z78521.1|CCZ78521', ...
2986
..., 'gi|2765613|emb|Z78488.1|PTZ78488',
2987
'gi|2765583|emb|Z78458.1|PHZ78458']
2990
You should recognise these strings from when we parsed the FASTA file
2991
earlier in Section 2.4.1. Suppose you would rather have something else
2992
as the keys - like the accession numbers. This brings us nicely to
2993
'SeqIO.to_dict()''s optional argument 'key_function', which lets you
2994
define what to use as the dictionary key for your records.
2995
First you must write your own function to return the key you want (as
2996
a string) when given a 'SeqRecord' object. In general, the details of
2997
function will depend on the sort of input records you are dealing with.
2998
But for our orchids, we can just split up the record's identifier using
2999
the "pipe" character (the vertical line) and return the fourth entry
3001
<<def get_accession(record) :
3002
""""Given a SeqRecord, return the accession number as a string.
3004
e.g. "gi|2765613|emb|Z78488.1|PTZ78488" -> "Z78488.1"
3006
parts = record.id.split("|")
3007
assert len(parts) == 5 and parts[0] == "gi" and parts[2] == "emb"
3011
Then we can give this function to the 'SeqIO.to_dict()' function to
3012
use in building the dictionary:
3013
<<from Bio import SeqIO
3014
handle = open("ls_orchid.fasta")
3015
orchid_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"),
3016
key_function=get_accession)
3018
print orchid_dict.keys()
3021
Finally, as desired, the new dictionary keys:
3022
<<>>> print orchid_dict.keys()
3023
['Z78484.1', 'Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1',
3024
'Z78453.1', ..., 'Z78471.1']
3027
Not too complicated, I hope!
3030
5.3.2 Indexing a dictionary using the SEGUID checksum
3031
======================================================
3033
To give another example of working with dictionaries of 'SeqRecord'
3034
objects, we'll use the SEGUID checksum function. This is a relatively
3035
recent checksum, and collisions should be very rare (i.e. two different
3036
sequences with the same checksum), an improvement on the CRC64 checksum.
3037
Once again, working with the orchids GenBank file:
3038
<<from Bio import SeqIO
3039
from Bio.SeqUtils.CheckSum import seguid
3040
for record in SeqIO.parse(open("ls_orchid.gbk"), "genbank") :
3041
print record.id, seguid(record.seq)
3045
<<Z78533.1 JUEoWn6DPhgZ9nAyowsgtoD9TTo
3046
Z78532.1 MN/s0q9zDoCVEEc+k/IFwCNF2pY
3048
Z78439.1 H+JfaShya/4yyAj7IbMqgNkxdxQ
3051
Now, recall the 'Bio.SeqIO.to_dict()' function's 'key_function'
3052
argument expects a function which turns a 'SeqRecord' into a string. We
3053
can't use the 'seguid()' function directly because it expects to be
3054
given a 'Seq' object (or a string). However, we can use Python's
3055
'lambda' feature to create a "one off" function to give to
3056
'Bio.SeqIO.to_dict()' instead:
3057
<<from Bio import SeqIO
3058
from Bio.SeqUtils.CheckSum import seguid
3059
seguid_dict = SeqIO.to_dict(SeqIO.parse(open("ls_orchid.gbk"),
3061
lambda rec : seguid(rec.seq))
3062
record = seguid_dict["MN/s0q9zDoCVEEc+k/IFwCNF2pY"]
3064
print record.description
3067
That should have retrieved the record Z78532.1, the second entry in
3071
5.4 Writing Sequence Files
3072
*=*=*=*=*=*=*=*=*=*=*=*=*=*
3075
We've talked about using 'Bio.SeqIO.parse()' for sequence input
3076
(reading files), and now we'll look at 'Bio.SeqIO.write()' which is for
3077
sequence output (writing files). This is a function taking three
3078
arguments: some 'SeqRecord' objects, a handle to write to, and a
3080
Here is an example, where we start by creating a few 'SeqRecord'
3081
objects the hard way (by hand, rather than by loading them from a file):
3082
<<from Bio.Seq import Seq
3083
from Bio.SeqRecord import SeqRecord
3084
from Bio.Alphabet import generic_protein
3087
SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQAL
3090
+"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \
3092
+"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \
3093
+"SSAC", generic_protein),
3094
id="gi|14150838|gb|AAK54648.1|AF376133_1",
3095
description="chalcone synthase [Cucumis sativus]")
3098
SeqRecord(Seq("YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLD
3100
+"DMVVVEIPKLGKEAAVKAIKEWGQ", generic_protein),
3101
id="gi|13919613|gb|AAK33142.1|",
3102
description="chalcone synthase [Fragaria vesca subsp.
3106
SeqRecord(Seq("MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFK
3109
+"EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP" \
3111
+"KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN" \
3113
+"NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV" \
3115
+"SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW" \
3117
+"IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT" \
3118
+"TGEGLEWGVLFGFGPGLTVETVVLHSVAT",
3120
id="gi|13925890|gb|AAK49457.1|",
3121
description="chalcone synthase [Nicotiana tabacum]")
3123
my_records = [rec1, rec2, rec3]
3126
Now we have a list of 'SeqRecord' objects, we'll write them to a FASTA
3128
<<from Bio import SeqIO
3129
handle = open("my_example.faa", "w")
3130
SeqIO.write(my_records, handle, "fasta")
3134
And if you open this file in your favourite text editor it should look
3136
<<>gi|14150838|gb|AAK54648.1|AF376133_1 chalcone synthase [Cucumis
3138
MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD
3139
GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK
3140
NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM
3142
>gi|13919613|gb|AAK33142.1| chalcone synthase [Fragaria vesca subsp.
3144
YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ
3145
DMVVVEIPKLGKEAAVKAIKEWGQ
3146
>gi|13925890|gb|AAK49457.1| chalcone synthase [Nicotiana tabacum]
3147
MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC
3148
EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP
3149
KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN
3150
NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV
3151
SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW
3152
IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT
3153
TGEGLEWGVLFGFGPGLTVETVVLHSVAT
3156
Suppose you wanted to know how many records the 'Bio.SeqIO.write()'
3157
function wrote to the handle? If your records were in a list you could
3158
just use 'len(my_records)', however you can't do that when your records
3159
come from a generator/iterator. Therefore as of Biopython 1.49, the
3160
'Bio.SeqIO.write()' function returns the number of 'SeqRecord' objects
3161
written to the file.
3164
5.4.1 Converting between sequence file formats
3165
===============================================
3167
In previous example we used a list of 'SeqRecord' objects as input to
3168
the 'Bio.SeqIO.write()' function, but it will also accept a 'SeqRecord'
3169
iterator like we get from 'Bio.SeqIO.parse()' -- this lets us do file
3170
conversion very succinctly. For this example we'll read in the GenBank
3171
format file ls_orchid.gbk (9) and write it out in FASTA format:
3172
<<from Bio import SeqIO
3173
in_handle = open("ls_orchid.gbk", "r")
3174
out_handle = open("my_example.fasta", "w")
3175
records = SeqIO.parse(in_handle, "genbank")
3176
SeqIO.write(records, out_handle, "fasta")
3181
In principle, just by changing the filenames and the format names,
3182
this code could be used to convert between any file formats available in
3183
Biopython. However, writing some formats requires information (e.g.
3184
quality scores) which other files formats don't contain. For example,
3185
while you can turn a FASTQ file into a FASTA file, you can't do the
3186
reverse. See also Section 14.1.6 in the cookbook chapter which looks at
3187
inter-converting between different FASTQ formats.
3188
You can simplify this by being lazy about closing the input file
3189
handles. This is arguably bad style, but it is more concise. Note that
3190
you should always close your outfile file handles as if you don't, your
3191
file may not get flushed to disk immediately.
3192
Alternatively, Python 2.6 includes 'with' as a new keyword (which can
3193
also be enabled on Python 2.5):
3194
<<from __future__ import with_statement #Needed on Python 2.5
3195
from Bio import SeqIO
3197
with in_handle = open("ls_orchid.gbk") :
3198
with out_handle = open("my_example.fasta", "w") :
3199
SeqIO.write(SeqIO.parse(in_handle, "genbank"), out_handle,
3203
Behind the scenes this will automatically close the handle handles
3204
(because the file objects are aware of the 'with' statement).
3207
5.4.2 Converting a file of sequences to their reverse complements
3208
==================================================================
3209
Suppose you had a file of nucleotide sequences, and you wanted to
3210
turn it into a file containing their reverse complement sequences. This
3211
time a little bit of work is required to transform the 'SeqRecord'
3212
objects we get from our input file into something suitable for saving to
3214
To start with, we'll use 'Bio.SeqIO.parse()' to load some nucleotide
3215
sequences from a file, then print out their reverse complements using
3216
the 'Seq' object's built in '.reverse_complement()' method (see
3218
<<from Bio import SeqIO
3219
in_handle = open("ls_orchid.gbk")
3220
for record in SeqIO.parse(in_handle, "genbank") :
3222
print record.seq.reverse_complement()
3226
Now, if we want to save these reverse complements to a file, we'll
3227
need to make 'SeqRecord' objects. For this I think its most elegant to
3228
write our own function, where we can decide how to name our new records:
3229
<<from Bio.SeqRecord import SeqRecord
3231
def make_rc_record(record) :
3232
"""Returns a new SeqRecord with the reverse complement
3234
return SeqRecord(seq = record.seq.reverse_complement(), \
3235
id = "rc_" + record.id, \
3236
description = "reverse complement")
3239
We can then use this to turn the input records into reverse complement
3240
records ready for output. If you don't mind about having all the records
3241
in memory at once, then the Python 'map()' function is a very intuitive
3243
<<from Bio import SeqIO
3245
in_handle = open("ls_orchid.fasta")
3246
records = map(make_rc_record, SeqIO.parse(in_handle, "fasta"))
3249
out_handle = open("rev_comp.fasta", "w")
3250
SeqIO.write(records, out_handle, "fasta")
3254
This is an excellent place to demonstrate the power of list
3255
comprehensions which in their simplest are a long-winded equivalent to
3256
using 'map()', like this:
3257
<<records = [make_rc_record(rec) for rec in SeqIO.parse(in_handle,
3261
Now list comprehensions have a nice trick up their sleeves, you can
3262
add a conditional statement:
3263
<<records = [make_rc_record(rec) for rec in SeqIO.parse(in_handle,
3264
"fasta") if len(rec)<700]
3267
That would create an in memory list of reverse complement records
3268
where the sequence length was under 700 base pairs. However, we can do
3269
exactly the same with a generator expression - but with the advantage
3270
that this does not create a list of all the records in memory at once:
3271
<<records = (make_rc_record(rec) for rec in SeqIO.parse(in_handle,
3272
"fasta") if len(rec)<700)
3275
If you don't mind being lax about closing input file handles, we have:
3276
<<from Bio import SeqIO
3278
records = (make_rc_record(rec) for rec in \
3279
SeqIO.parse(open("ls_orchid.fasta"), "fasta") \
3282
out_handle = open("rev_comp.fasta", "w")
3283
SeqIO.write(records, out_handle, "fasta")
3287
There is a related example in Section 14.1.2, translating each record
3288
in a FASTA file from nucleotides to amino acids.
3291
5.4.3 Getting your SeqRecord objects as formatted strings
3292
==========================================================
3293
Suppose that you don't really want to write your records to a file
3294
or handle -- instead you want a string containing the records in a
3295
particular file format. The 'Bio.SeqIO' interface is based on handles,
3296
but Python has a useful built in module which provides a string based
3298
For an example of how you might use this, let's load in a bunch of
3299
'SeqRecord' objects from our orchids GenBank file, and create a string
3300
containing the records in FASTA format:
3301
<<from Bio import SeqIO
3302
from StringIO import StringIO
3304
records = SeqIO.parse(open("ls_orchid.gbk"), "genbank")
3306
out_handle = StringIO()
3307
SeqIO.write(records, out_handle, "fasta")
3308
fasta_data = out_handle.getvalue()
3313
This isn't entirely straightforward the first time you see it! On the
3314
bright side, for the special case where you would like a string
3315
containing a single record in a particular file format, Biopython 1.48
3316
added a new 'format()' method to the 'SeqRecord' class (see
3318
Note that although we don't encourage it, you can use the 'format()'
3319
method to write to a file, like this:
3320
<<from Bio import SeqIO
3321
record_iterator = SeqIO.parse(open("ls_orchid.gbk"), "genbank")
3322
out_handle = open("ls_orchid.tab", "w")
3323
for record in record_iterator :
3324
out_handle.write(record.format("tab"))
3327
While this style of code will work for a simple sequential file format
3328
like FASTA or the simple tab separated format used in this example, it
3329
will not work for more complex or interlaced file formats. This is why
3330
we still recommend using 'Bio.SeqIO.write()', as in the following
3332
<<from Bio import SeqIO
3333
record_iterator = SeqIO.parse(open("ls_orchid.gbk"), "genbank")
3334
out_handle = open("ls_orchid.tab", "w")
3335
SeqIO.write(record_iterator, out_handle, "tab")
3339
-----------------------------------
3342
(1) http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html
3344
(2) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta
3346
(3) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk
3348
(4) http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html
3350
(5) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk
3352
(6) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk
3354
(7) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta
3356
(8) http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html
3358
(9) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk
3361
Chapter 6 Sequence Alignment Input/Output, and Alignment Tools
3362
*****************************************************************
3364
In this chapter we'll discuss the 'Bio.AlignIO' module, which is very
3365
similar to the 'Bio.SeqIO' module from the previous chapter, but deals
3366
with 'Alignment' objects rather than 'SeqRecord' objects. This aims to
3367
provide a simple interface for working with assorted sequence alignment
3368
file formats in a uniform way.
3369
Note that both 'Bio.SeqIO' and 'Bio.AlignIO' can read and write
3370
sequence alignment files. The appropriate choice will depend largely on
3371
what you want to do with the data.
3372
The final part of this chapter is about our command line wrappers for
3373
common multiple sequence alignment tools like ClustalW and MUSCLE.
3376
6.1 Parsing or Reading Sequence Alignments
3377
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
3380
We have two functions for reading in sequence alignments,
3381
'Bio.AlignIO.read()' and 'Bio.AlignIO.parse()' which following the
3382
convention introduced in 'Bio.SeqIO' are for files containing one or
3383
multiple alignments respectively.
3384
Using 'Bio.AlignIO.parse()' will return an iterator which gives
3385
'Alignment' objects. Iterators are typically used in a for loop.
3386
Examples of situations where you will have multiple different alignments
3387
include resampled alignments from the PHYLIP tool 'seqboot', or multiple
3388
pairwise alignments from the EMBOSS tools 'water' or 'needle', or Bill
3389
Pearson's FASTA tools.
3390
However, in many situations you will be dealing with files which
3391
contain only a single alignment. In this case, you should use the
3392
'Bio.AlignIO.read()' function which returns a single 'Alignment' object.
3393
Both functions expect two mandatory arguments:
3396
1. The first argument is a handle to read the data from, typically an
3397
open file (see Section 18.1).
3398
2. The second argument is a lower case string specifying the
3399
alignment format. As in 'Bio.SeqIO' we don't try and guess the file
3400
format for you! See http://biopython.org/wiki/AlignIO for a full
3401
listing of supported formats.
3403
There is also an optional 'seq_count' argument which is discussed in
3404
Section 6.1.3 below for dealing with ambiguous file formats which may
3405
contain more than one alignment.
3406
Biopython 1.49 introduced a further optional 'alphabet' argument
3407
allowing you to specify the expected alphabet. This can be useful as
3408
many alignment file formats do not explicitly label the sequences as
3409
RNA, DNA or protein -- which means 'Bio.AlignIO' will default to using a
3413
6.1.1 Single Alignments
3414
========================
3415
As an example, consider the following annotation rich protein
3416
alignment in the PFAM or Stockholm file format:
3418
#=GS COATB_BPIKE/30-81 AC P03620.1
3419
#=GS COATB_BPIKE/30-81 DR PDB; 1ifl ; 1-52;
3420
#=GS Q9T0Q8_BPIKE/1-52 AC Q9T0Q8.1
3421
#=GS COATB_BPI22/32-83 AC P15416.1
3422
#=GS COATB_BPM13/24-72 AC P69541.1
3423
#=GS COATB_BPM13/24-72 DR PDB; 2cpb ; 1-49;
3424
#=GS COATB_BPM13/24-72 DR PDB; 2cps ; 1-49;
3425
#=GS COATB_BPZJ2/1-49 AC P03618.1
3426
#=GS Q9T0Q9_BPFD/1-49 AC Q9T0Q9.1
3427
#=GS Q9T0Q9_BPFD/1-49 DR PDB; 1nh4 A; 1-49;
3428
#=GS COATB_BPIF1/22-73 AC P03619.2
3429
#=GS COATB_BPIF1/22-73 DR PDB; 1ifk ; 1-50;
3431
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA
3432
#=GR COATB_BPIKE/30-81 SS
3433
-HHHHHHHHHHHHHH--HHHHHHHH--HHHHHHHHHHHHHHHHHHHHH----
3435
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA
3437
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA
3439
AEGDDP...AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
3440
#=GR COATB_BPM13/24-72 SS
3441
---S-T...CHCHHHHCCCCTCCCTTCHHHHHHHHHHHHHHHHHHHHCTT--
3443
AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA
3445
AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
3446
#=GR Q9T0Q9_BPFD/1-49 SS
3447
------...-HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH--
3449
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA
3450
#=GR COATB_BPIF1/22-73 SS
3451
XX-HHHH--HHHHHH--HHHHHHH--HHHHHHHHHHHHHHHHHHHHHHH---
3453
XHHHHHHHHHHHHHHHCHHHHHHHHCHHHHHHHHHHHHHHHHHHHHHHHC--
3455
AEssss...AptAhDSLpspAT-hIu.sWshVsslVsAsluIKLFKKFsSKA
3459
This is the seed alignment for the Phage_Coat_Gp8 (PF05371) PFAM
3460
entry, downloaded as a compressed archive from
3461
http://pfam.sanger.ac.uk/family/alignment/download/gzipped?acc=PF05371&a
3462
lnType=seed. We can load this file as follows (assuming it has been
3463
saved to disk as "PF05371_seed.sth" in the current working directory):
3464
<<from Bio import AlignIO
3465
alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm")
3469
This code will print out a summary of the alignment:
3470
<<SingleLetterAlphabet() alignment with 7 rows and 52 columns
3471
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKA COATB_BPIKE/30-81
3472
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRA Q9T0Q8_BPIKE/1-52
3473
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKA COATB_BPI22/32-83
3474
AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPM13/24-72
3475
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPZJ2/1-49
3476
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA Q9T0Q9_BPFD/1-49
3477
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRA COATB_BPIF1/22-73
3480
You'll notice in the above output the sequences have been truncated.
3481
We could instead write our own code to format this as we please by
3482
iterating over the rows as 'SeqRecord' objects:
3483
<<from Bio import AlignIO
3484
alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm")
3485
print "Alignment length %i" % alignment.get_alignment_length()
3486
for record in alignment :
3487
print "%s - %s" % (record.seq, record.id)
3490
This will give the following output:
3491
<<Alignment length 52
3492
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA -
3494
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA -
3496
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA -
3498
AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA -
3500
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA -
3502
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA -
3504
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA -
3508
You could also use the alignment object's 'format' method to show it
3509
in a particular file format -- see Section 6.2.2 for details.
3510
Did you notice in the raw file above that several of the sequences
3511
include database cross-references to the PDB and the associated known
3512
secondary structure? Try this:
3513
<<for record in alignment :
3515
print record.id, record.dbxrefs
3519
<<COATB_BPIKE/30-81 ['PDB; 1ifl ; 1-52;']
3520
COATB_BPM13/24-72 ['PDB; 2cpb ; 1-49;', 'PDB; 2cps ; 1-49;']
3521
Q9T0Q9_BPFD/1-49 ['PDB; 1nh4 A; 1-49;']
3522
COATB_BPIF1/22-73 ['PDB; 1ifk ; 1-50;']
3525
To have a look at all the sequence annotation, try this:
3526
<<for record in alignment :
3530
Sanger provide a nice web interface at
3531
http://pfam.sanger.ac.uk/family?acc=PF05371 which will actually let you
3532
download this alignment in several other formats. This is what the file
3533
looks like in the FASTA file format:
3534
<<>COATB_BPIKE/30-81
3535
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA
3537
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA
3539
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA
3541
AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
3543
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA
3545
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
3547
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA
3550
Note the website should have an option about showing gaps as periods
3551
(dots) or dashes, we've shown dashes above. Assuming you download and
3552
save this as file "PF05371_seed.faa" then you can load it with almost
3553
exactly the same code:
3554
<<from Bio import AlignIO
3555
alignment = AlignIO.read(open("PF05371_seed.faa"), "fasta")
3559
All that has changed in this code is the filename and the format
3560
string. You'll get the same output as before, the sequences and record
3561
identifiers are the same. However, as you should expect, if you check
3562
each 'SeqRecord' there is no annotation nor database cross-references
3563
because these are not included in the FASTA file format.
3564
Note that rather than using the Sanger website, you could have used
3565
'Bio.AlignIO' to convert the original Stockholm format file into a FASTA
3566
file yourself (see below).
3567
With any supported file format, you can load an alignment in exactly
3568
the same way just by changing the format string. For example, use
3569
"phylip" for PHYLIP files, "nexus" for NEXUS files or "emboss" for the
3570
alignments output by the EMBOSS tools. There is a full listing on the
3571
wiki page (http://biopython.org/wiki/AlignIO) and in the built in
3572
documentation (also online (1)):
3573
<<>>> from Bio import AlignIO
3580
6.1.2 Multiple Alignments
3581
==========================
3583
The previous section focused on reading files containing a single
3584
alignment. In general however, files can contain more than one
3585
alignment, and to read these files we must use the 'Bio.AlignIO.parse()'
3587
Suppose you have a small alignment in PHYLIP format:
3596
If you wanted to bootstrap a phylogenetic tree using the PHYLIP tools,
3597
one of the steps would be to create a set of many resampled alignments
3598
using the tool 'bootseq'. This would give output something like this,
3599
which has been abbreviated for conciseness:
3627
If you wanted to read this in using 'Bio.AlignIO' you could use:
3628
<<from Bio import AlignIO
3629
alignments = AlignIO.parse(open("resampled.phy"), "phylip")
3630
for alignment in alignments :
3635
This would give the following output, again abbreviated for display:
3636
<<SingleLetterAlphabet() alignment with 5 rows and 6 columns
3643
SingleLetterAlphabet() alignment with 5 rows and 6 columns
3650
SingleLetterAlphabet() alignment with 5 rows and 6 columns
3659
SingleLetterAlphabet() alignment with 5 rows and 6 columns
3667
As with the function 'Bio.SeqIO.parse()', using 'Bio.AlignIO.parse()'
3668
returns an iterator. If you want to keep all the alignments in memory at
3669
once, which will allow you to access them in any order, then turn the
3670
iterator into a list:
3671
<<from Bio import AlignIO
3672
alignments = list(AlignIO.parse(open("resampled.phy"), "phylip"))
3673
last_align = alignments[-1]
3674
first_align = alignments[0]
3679
6.1.3 Ambiguous Alignments
3680
===========================
3681
Many alignment file formats can explicitly store more than one
3682
alignment, and the division between each alignment is clear. However,
3683
when a general sequence file format has been used there is no such block
3684
structure. The most common such situation is when alignments have been
3685
saved in the FASTA file format. For example consider the following:
3700
This could be a single alignment containing six sequences (with
3701
repeated identifiers). Or, judging from the identifiers, this is
3702
probably two different alignments each with three sequences, which
3703
happen to all have the same length.
3704
What about this next example?
3719
Again, this could be a single alignment with six sequences. However
3720
this time based on the identifiers we might guess this is three pairwise
3721
alignments which by chance have all got the same lengths.
3722
This final example is similar:
3732
--ACTACGAC--TAGCTCAGG
3734
GGACTACGACAATAGCTCAGG
3737
In this third example, because of the differing lengths, this cannot
3738
be treated as a single alignment containing all six records. However, it
3739
could be three pairwise alignments.
3740
Clearly trying to store more than one alignment in a FASTA file is not
3741
ideal. However, if you are forced to deal with these as input files
3742
'Bio.AlignIO' can cope with the most common situation where all the
3743
alignments have the same number of records. One example of this is a
3744
collection of pairwise alignments, which can be produced by the EMBOSS
3745
tools 'needle' and 'water' -- although in this situation, 'Bio.AlignIO'
3746
should be able to understand their native output using "emboss" as the
3748
To interpret these FASTA examples as several separate alignments, we
3749
can use 'Bio.AlignIO.parse()' with the optional 'seq_count' argument
3750
which specifies how many sequences are expected in each alignment (in
3751
these examples, 3, 2 and 2 respectively). For example, using the third
3752
example as the input data:
3753
<<for alignment in AlignIO.parse(handle, "fasta", seq_count=2) :
3754
print "Alignment length %i" % alignment.get_alignment_length()
3755
for record in alignment :
3756
print "%s - %s" % (record.seq, record.id)
3761
<<Alignment length 19
3762
ACTACGACTAGCTCAG--G - Alpha
3763
ACTACCGCTAGCTCAGAAG - XXX
3766
ACTACGACTAGCTCAGG - Alpha
3767
ACTACGGCAAGCACAGG - YYY
3770
--ACTACGAC--TAGCTCAGG - Alpha
3771
GGACTACGACAATAGCTCAGG - ZZZ
3774
Using 'Bio.AlignIO.read()' or 'Bio.AlignIO.parse()' without the
3775
'seq_count' argument would give a single alignment containing all six
3776
records for the first two examples. For the third example, an exception
3777
would be raised because the lengths differ preventing them being turned
3778
into a single alignment.
3779
If the file format itself has a block structure allowing 'Bio.AlignIO'
3780
to determine the number of sequences in each alignment directly, then
3781
the 'seq_count' argument is not needed. If it is supplied, and doesn't
3782
agree with the file contents, an error is raised.
3783
Note that this optional 'seq_count' argument assumes each alignment in
3784
the file has the same number of sequences. Hypothetically you may come
3785
across stranger situations, for example a FASTA file containing several
3786
alignments each with a different number of sequences -- although I would
3787
love to hear of a real world example of this. Assuming you cannot get
3788
the data in a nicer file format, there is no straight forward way to
3789
deal with this using 'Bio.AlignIO'. In this case, you could consider
3790
reading in the sequences themselves using 'Bio.SeqIO' and batching them
3791
together to create the alignments as appropriate.
3794
6.2 Writing Alignments
3795
*=*=*=*=*=*=*=*=*=*=*=*
3798
We've talked about using 'Bio.AlignIO.read()' and
3799
'Bio.AlignIO.parse()' for alignment input (reading files), and now we'll
3800
look at 'Bio.AlignIO.write()' which is for alignment output (writing
3801
files). This is a function taking three arguments: some 'Alignment'
3802
objects, a handle to write to, and a sequence format.
3803
Here is an example, where we start by creating a few 'Alignment'
3804
objects the hard way (by hand, rather than by loading them from a file):
3805
<<from Bio.Align.Generic import Alignment
3806
from Bio.Alphabet import IUPAC, Gapped
3807
alphabet = Gapped(IUPAC.unambiguous_dna)
3809
align1 = Alignment(alphabet)
3810
align1.add_sequence("Alpha", "ACTGCTAGCTAG")
3811
align1.add_sequence("Beta", "ACT-CTAGCTAG")
3812
align1.add_sequence("Gamma", "ACTGCTAGDTAG")
3814
align2 = Alignment(alphabet)
3815
align2.add_sequence("Delta", "GTCAGC-AG")
3816
align2.add_sequence("Epislon","GACAGCTAG")
3817
align2.add_sequence("Zeta", "GTCAGCTAG")
3819
align3 = Alignment(alphabet)
3820
align3.add_sequence("Eta", "ACTAGTACAGCTG")
3821
align3.add_sequence("Theta", "ACTAGTACAGCT-")
3822
align3.add_sequence("Iota", "-CTACTACAGGTG")
3824
my_alignments = [align1, align2, align3]
3827
Now we have a list of 'Alignment' objects, we'll write them to a
3829
<<from Bio import AlignIO
3830
handle = open("my_example.phy", "w")
3831
SeqIO.write(my_alignments, handle, "phylip")
3835
And if you open this file in your favourite text editor it should look
3847
Theta ACTAGTACAG CT-
3851
Its more common to want to load an existing alignment, and save that,
3852
perhaps after some simple manipulation like removing certain rows or
3854
Suppose you wanted to know how many alignments the
3855
'Bio.AlignIO.write()' function wrote to the handle? If your alignments
3856
were in a list like the example above, you could just use
3857
'len(my_alignments)', however you can't do that when your records come
3858
from a generator/iterator. Therefore as of Biopython 1.49, the
3859
'Bio.AlignIO.write()' function returns the number of alignments written
3863
6.2.1 Converting between sequence alignment file formats
3864
=========================================================
3866
Converting between sequence alignment file formats with 'Bio.AlignIO'
3867
works in the same way as converting between sequence file formats with
3868
'Bio.SeqIO' -- we load generally the alignment(s) using
3869
'Bio.AlignIO.parse()' and then save them using the
3870
'Bio.AlignIO.write()'.
3871
For this example, we'll load the PFAM/Stockholm format file used
3872
earlier and save it as a Clustal W format file:
3873
<<from Bio import AlignIO
3874
alignments = AlignIO.parse(open("PF05371_seed.sth"), "stockholm")
3875
handle = open("PF05371_seed.aln","w")
3876
AlignIO.write(alignments, handle, "clustal")
3880
The 'Bio.AlignIO.write()' function expects to be given multiple
3881
alignment objects. In the example above we gave it the alignment
3882
iterator returned by 'Bio.AlignIO.parse()'.
3883
In this case, we know there is only one alignment in the file so we
3884
could have used 'Bio.AlignIO.read()' instead, but notice we have to pass
3885
this alignment to 'Bio.AlignIO.write()' as a single element list:
3886
<<from Bio import AlignIO
3887
alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm")
3888
handle = open("PF05371_seed.aln","w")
3889
AlignIO.write([alignment], handle, "clustal")
3893
Either way, you should end up with the same new Clustal W format file
3894
"PF05371_seed.aln" with the following content:
3895
<<CLUSTAL X (1.81) multiple sequence alignment
3899
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS
3901
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS
3903
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS
3905
AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
3907
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS
3909
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
3911
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS
3913
COATB_BPIKE/30-81 KA
3914
Q9T0Q8_BPIKE/1-52 RA
3915
COATB_BPI22/32-83 KA
3916
COATB_BPM13/24-72 KA
3919
COATB_BPIF1/22-73 RA
3922
Alternatively, you could make a PHYLIP format file which we'll name
3924
<<from Bio import AlignIO
3925
alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm")
3926
handle = open("PF05371_seed.phy","w")
3927
AlignIO.write([alignment], handle, "phylip")
3931
This time the output looks like this:
3933
COATB_BPIK AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIRLFKKFSS
3934
Q9T0Q8_BPI AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIKLFKKFVS
3935
COATB_BPI2 DGTSTATSYA TEAMNSLKTQ ATDLIDQTWP VVTSVAVAGL AIRLFKKFSS
3936
COATB_BPM1 AEGDDP---A KAAFNSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS
3937
COATB_BPZJ AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFAS
3938
Q9T0Q9_BPF AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS
3939
COATB_BPIF FAADDATSQA KAAFDSLTAQ ATEMSGYAWA LVVLVVGATV GIKLFKKFVS
3950
One of the big handicaps of the PHYLIP alignment file format is that
3951
the sequence identifiers are strictly truncated at ten characters. In
3952
this example, as you can see the resulting names are still unique - but
3953
they are not very readable. In this particular case, there is no clear
3954
way to compress the identifers, but for the sake of argument you may
3955
want to assign your own names or numbering system. This following bit of
3956
code manipulates the record identifiers before saving the output:
3957
<<from Bio import AlignIO
3958
alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm")
3960
for i, record in enumerate(alignment) :
3961
name_mapping[i] = record.id
3962
record.id = "seq%i" % i
3965
handle = open("PF05371_seed.phy","w")
3966
AlignIO.write([alignment], handle, "phylip")
3970
This code used a Python dictionary to record a simple mapping from the
3971
new sequence system to the original identifier:
3972
<<{0: 'COATB_BPIKE/30-81', 1: 'Q9T0Q8_BPIKE/1-52', 2:
3973
'COATB_BPI22/32-83', ...}
3976
Here is the new PHYLIP format output:
3978
seq0 AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIRLFKKFSS
3979
seq1 AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIKLFKKFVS
3980
seq2 DGTSTATSYA TEAMNSLKTQ ATDLIDQTWP VVTSVAVAGL AIRLFKKFSS
3981
seq3 AEGDDP---A KAAFNSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS
3982
seq4 AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFAS
3983
seq5 AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS
3984
seq6 FAADDATSQA KAAFDSLTAQ ATEMSGYAWA LVVLVVGATV GIKLFKKFVS
3995
In general, because of the identifier limitation, working with PHYLIP
3996
file formats shouldn't be your first choice. Using the PFAM/Stockholm
3997
format on the other hand allows you to record a lot of additional
4001
6.2.2 Getting your Alignment objects as formatted strings
4002
==========================================================
4003
The 'Bio.AlignIO' interface is based on handles, which means if you
4004
want to get your alignment(s) into a string in a particular file format
4005
you need to do a little bit more work (see below). However, you will
4006
probably prefer to take advantage of the new 'format()' method added to
4007
the 'Alignment' object in Biopython 1.48. This takes a single mandatory
4008
argument, a lower case string which is supported by 'Bio.AlignIO' as an
4009
output format. For example:
4010
<<from Bio import AlignIO
4011
alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm")
4012
print alignment.format("clustal")
4015
As described in Section 4.5), the 'SeqRecord' object has a similar
4016
method using output formats supported by 'Bio.SeqIO'.
4017
Internally the 'format()' method is using the 'StringIO' string based
4018
handle and calling 'Bio.AlignIO.write()'. You can do this in your own
4019
code if for example you are using an older version of Biopython:
4020
<<from Bio import AlignIO
4021
from StringIO import StringIO
4023
alignments = AlignIO.parse(open("PF05371_seed.sth"), "stockholm")
4025
out_handle = StringIO()
4026
AlignIO.write(alignments, out_handle, "clustal")
4027
clustal_data = out_handle.getvalue()
4035
*=*=*=*=*=*=*=*=*=*=
4038
There are lots of algorithms out there for aligning sequences, both
4039
pairwise alignments and multiple sequence alignments. These calculations
4040
are relatively slow, and you generally wouldn't want to write such an
4041
algorithm in Python. Instead, you can use Biopython to invoke a command
4042
line tool on your behalf. Normally you would:
4044
1. Prepare an input file of your unaligned sequences, typically this
4045
will be a FASTA file which you might create using 'Bio.SeqIO' (see
4047
2. Call the command line tool to process this input file, typically
4048
via one of Biopython's command line wrappers (which we'll discuss
4050
3. Read the output from the tool, i.e. your aligned sequences,
4051
typically using 'Bio.AlignIO' (see earlier in this chapter).
4053
All the command line wrappers we're going to talk about in this
4054
chapter follow the same style. You create a command line object
4055
specifying the options (e.g. the input filename and the output
4056
filename), then invoke this command line via a Python operating system
4057
call (e.g. using the subprocess module).
4058
Most of these wrappers are defined in the 'Bio.Align.Applications'
4060
<<>>> import Bio.Align.Applications
4061
>>> dir(Bio.Align.Applications)
4063
['ClustalwCommandline', 'DialignCommandline', 'MafftCommandline',
4064
'MuscleCommandline',
4065
'PrankCommandline', 'ProbconsCommandline', 'TCoffeeCommandline' ...]
4068
(Ignore the entries starting with an underscore -- these have special
4069
meaning in Python.) The module 'Bio.Emboss.Applications' has wrappers
4070
for some of the EMBOSS suite (2), including needle and water, which are
4071
described below in Section 6.3.5. We won't explore all these alignment
4072
tools here in the section, just a sample, but the same principles apply.
4077
ClustalW is a popular command line tool for multiple sequence
4078
alignment (there is also a graphical interface called ClustalX).
4079
Biopython's 'Bio.Align.Applications' module has a wrapper for this
4080
alignment tool (and several others).
4081
Before trying to use ClustalW from within Python, you should first try
4082
running the ClustalW tool yourself by hand at the command line, to
4083
familiarise yourself the other options. You'll find the Biopython
4084
wrapper is very faithful to the actual command line API:
4085
<<>>> from Bio.Align.Applications import ClustalwCommandline
4086
>>> help(ClustalwCommandline)
4090
For the most basic usage, all you need is to have a FASTA input file,
4091
such as opuntia.fasta (3) (available online or in the Doc/examples
4092
subdirectory of the Biopython source code). This is a small FASTA file
4093
containing seven prickly-pear DNA sequences (from the cactus family
4095
By default ClustalW will generate an alignment and guide tree file
4096
with names based on the input FASTA file, in this case opuntia.aln and
4097
opuntia.dnd, but you can override this or make it explicit:
4098
<<>>> from Bio.Align.Applications import ClustalwCommandline
4099
>>> cline = ClustalwCommandline("clustalw2", infile="opuntia.fasta")
4101
clustalw2 -infile=opuntia.fasta
4104
Notice here we have given the executable name as clustalw2, indicating
4105
we have version two installed, which has a different filename to version
4106
one (clustalw, the default). Fortunately both versions support the same
4107
set of arguments at the command line (and indeed, should be functionally
4109
You may find that even though you have ClustalW installed, the above
4110
command doesn't work -- you may get a message about "command not found"
4111
(especially on Windows). This indicated that the ClustalW executable is
4112
not on your PATH (an environment variable, a list of directories to be
4113
searched). You can either update your PATH setting to include the
4114
location of your copy of ClustalW tools (how you do this will depend on
4115
your OS), or simply type in the full path of the tool. You can also tell
4116
Biopython the full path to the tool, for example:
4118
>>> from Bio.Align.Applications import ClustalwCommandline
4119
>>> clustalw_exe = r"C:\Program Files\new clustal\clustalw2.exe"
4120
>>> assert os.path.isfile(clustalw_exe), "Clustal W executable
4122
>>> cline = ClustalwCommandline(clustalw_exe, infile="opuntia.fasta")
4125
Remember, in Python a default string '\n' and/or '\t' translates as a
4126
new line and/or a tab -- which is why we're put a letter "r" at the
4127
start for a raw string that isn't translated in this way. This is
4128
generally good practice when specifying a Windows style file name.
4129
You can now run this command line, and in Python the recommended way
4130
to run another program is to use the 'subprocess' module. This replaces
4131
older options like the 'os.system()' and the 'os.popen*' functions.
4132
Now, at this point it helps to know about how command line tools
4133
"work". When you run a tool at the command line, it will often print
4134
text output and/or directly to screen. This text can be captured or
4135
redirected, via two "pipes", called standard output (the normal results)
4136
and standard error (for error messages and debug messages). There is
4137
also standard input, which is any text fed into the tool. These names
4138
get shortened to stdin, stdout and stderr. When the tool finished, it
4139
has a return code (an integer), which by convention is zero for success.
4140
Now let's try an example. Unfortunately there are some subtle
4141
differences depending on your Operating System (e.g. Windows versus
4142
Unix) and how you are running Python (e.g. at the command line, or in
4143
IDLE). You may see a second terminal window appear while ClustalW works.
4144
You may see the ClustalW output appear at your Python prompt.
4146
>>> import subprocess
4147
>>> return_code = subprocess.call(str(cline),
4148
shell=(sys.platform!="win32"))
4150
>>> print return_code
4154
In the case of ClustlW, when run at the command line all the important
4155
output is written directly to the output files. Everything printed to
4156
screen while you wait (via stdout or stderr) is boring and can be
4157
ignored (assuming it worked). You can explicitly tell ClustalW to send
4158
this output to "dev null", a kind of black hole for command line tool
4162
>>> import subprocess
4163
>>> return_code = subprocess.call(str(cline),
4164
... stdout = open(os.devnull),
4165
... stderr = open(os.devnull),
4166
... shell=(sys.platform!="win32"))
4167
>>> print return_code
4171
This time ClustalW should be "quiet", because all its terminal output
4172
has been discarded. What we care about are the two output files, the
4173
alignment and the guide tree. We didn't tell ClustalW what filenames to
4174
use, but it defaults to picking names based on the input file. In this
4175
case the output should be in the file 'opuntia.aln'. You should be able
4176
to work out how to read in the alignment using 'Bio.AlignIO' by now:
4177
<<>>> from Bio import AlignIO
4178
>>> align = AlignIO.read(open("opuntia.aln"), "clustal")
4180
SingleLetterAlphabet() alignment with 7 rows and 906 columns
4181
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4182
gi|6273285|gb|AF191659.1|AF191
4183
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4184
gi|6273284|gb|AF191658.1|AF191
4185
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4186
gi|6273287|gb|AF191661.1|AF191
4187
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4188
gi|6273286|gb|AF191660.1|AF191
4189
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4190
gi|6273290|gb|AF191664.1|AF191
4191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4192
gi|6273289|gb|AF191663.1|AF191
4193
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4194
gi|6273291|gb|AF191665.1|AF191
4197
In case you are interested (and this is an aside from the main thrust
4198
of this chapter), the opuntia.dnd file ClustalW creates is just a
4199
standard Newick tree file, and 'Bio.Nexus' can parse these. You can
4200
either parse the file directly (as though it were a light weight NEXUS
4201
file), or just go directly to a tree object:
4202
<<>>> from Bio.Nexus import Trees
4203
>>> tree_string = open("opuntia.dnd").read()
4204
>>> tree = Trees.Tree(tree_string)
4207
((gi|6273291|gb|AF191665.1|AF191665,(gi|6273290|gb|AF191664.1|AF191664,
4208
gi|6273289|gb|AF191663.1|AF191663)),(gi|6273287|gb|AF191661.1|AF191661
4210
gi|6273286|gb|AF191660.1|AF191660),(gi|6273285|gb|AF191659.1|AF191659,
4211
gi|6273284|gb|AF191658.1|AF191658));
4212
>>> print tree.display()
4213
# taxon prev succ brlen blen
4214
(sum) support comment
4215
0 - None [1, 6, 9] 0.0
4219
2 gi|6273291|gb|AF191665.1|AF191665 1 [] 0.00418
4221
3 - 1 [4, 5] 0.00083
4223
4 gi|6273290|gb|AF191664.1|AF191664 3 [] 0.00189
4225
5 gi|6273289|gb|AF191663.1|AF191663 3 [] 0.00145
4227
6 - 0 [7, 8] 0.00014
4229
7 gi|6273287|gb|AF191661.1|AF191661 6 [] 0.00489
4231
8 gi|6273286|gb|AF191660.1|AF191660 6 [] 0.00295
4233
9 - 0 [10, 11] 0.00125
4235
10 gi|6273285|gb|AF191659.1|AF191659 9 [] 0.00094
4237
11 gi|6273284|gb|AF191658.1|AF191658 9 [] 0.00018
4244
The spacing has been adjusted here for display purposes. The tree
4245
object is actually pretty powerful! Have a look at the list of methods
4246
with dir(tree) to get a hint of this...
4251
MUSCLE is a more recent multiple sequence alignment tool than
4252
ClustalW, and Biopython also has a wrapper for it under the
4253
'Bio.Align.Applications' module. As before, we recommend you try using
4254
MUSCLE from the command line before trying it from within Python, as the
4255
Biopython wrapper is very faithful to the actual command line API:
4256
<<>>> from Bio.Align.Applications import MuscleCommandline
4257
>>> help(MuscleCommandline)
4261
For the most basic usage, all you need is to have a FASTA input file,
4262
such as opuntia.fasta (4) (available online or in the Doc/examples
4263
subdirectory of the Biopython source code). You can then tell MUSCLE to
4264
read in this FASTA file, and write the alignment to an output file:
4265
<<>>> from Bio.Align.Applications import MuscleCommandline
4266
>>> cline = MuscleCommandline(input="opuntia.fasta",
4269
muscle -in opuntia.fasta -out opuntia.txt
4272
Note that MUSCLE uses "-in" and "-out" but in Biopython we have to use
4273
"input" and "out" as the keyword arguments or property names. This is
4274
because "in" is a reserved word in Python.
4275
By default MUSCLE will output the alignment as a FASTA file (using
4276
gapped sequences). The 'Bio.AlignIO' module should be able to read this
4277
alignment using format="fasta". You can also ask for ClustalW-like
4279
<<>>> from Bio.Align.Applications import MuscleCommandline
4280
>>> cline = MuscleCommandline(input="opuntia.fasta",
4281
out="opuntia.aln", clw=True)
4283
muscle -in opuntia.fasta -out opuntia.aln -clw
4286
Or, strict ClustalW output where the original ClustalW header line is
4287
used for maximum compatibility:
4288
<<>>> from Bio.Align.Applications import MuscleCommandline
4289
>>> cline = MuscleCommandline(input="opuntia.fasta",
4290
out="opuntia.aln", clwstrict=True)
4292
muscle -in opuntia.fasta -out opuntia.aln -clwstrict
4295
The 'Bio.AlignIO' module should be able to read these alignments using
4297
MUSCLE can also output in GCG MSF format (using the msf argument), but
4298
Biopython can't currently parse that, or using HTML which would give a
4299
human readable web page (not suitable for parsing).
4300
You can also set the other optional parameters, for example the
4301
maximum number of iterations. See the built in help for details.
4302
You would then run MUSCLE command line string as described above for
4303
ClustalW, and parse the output using 'Bio.AlignIO' to get an alignment
4307
6.3.3 MUSCLE using stdout
4308
==========================
4310
Using a MUSCLE command line as in the examples above will write the
4311
alignment to a file. This means there will be no important information
4312
written to the standard out (stdout) or standard error (stderr) handles.
4313
However, by default MUSCLE will write the alignment to standard output
4314
(stdout). We can take advantage of this to avoid having a temporary
4315
output file! For example:
4316
<<>>> from Bio.Align.Applications import MuscleCommandline
4317
>>> cline = MuscleCommandline(input="opuntia.fasta")
4319
muscle -in opuntia.fasta
4322
Now, let's run this and capture the output as handles. Remember that
4323
MUSCLE defaults to using FASTA as the output format:
4324
<<>>> import subprocess
4325
>>> child = subprocess.Popen(str(cline),
4326
... stdout=subprocess.PIPE,
4327
... stderr=subprocess.PIPE,
4328
... shell=(sys.platform!="win32"))
4329
>>> from Bio import AlignIO
4330
>>> align = AlignIO.read(child.stdout, "fasta")
4332
SingleLetterAlphabet() alignment with 7 rows and 906 columns
4333
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4334
gi|6273289|gb|AF191663.1|AF191663
4335
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4336
gi|6273291|gb|AF191665.1|AF191665
4337
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4338
gi|6273290|gb|AF191664.1|AF191664
4339
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4340
gi|6273287|gb|AF191661.1|AF191661
4341
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4342
gi|6273286|gb|AF191660.1|AF191660
4343
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4344
gi|6273285|gb|AF191659.1|AF191659
4345
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4346
gi|6273284|gb|AF191658.1|AF191658
4351
6.3.4 MUSCLE using stdin and stdout
4352
====================================
4354
We don't actually need to have our FASTA input sequences prepared in a
4355
file, because by default MUSCLE will read in the input sequence from
4357
First, we'll need some unaligned sequences in memory as 'SeqRecord'
4358
objects. For this demonstration I'm going to use a filtered version of
4359
the original FASTA file (using a generator expression), taking just six
4360
of the seven sequences:
4361
<<>>> from Bio import SeqIO
4362
>>> records = (r for r in SeqIO.parse(open("opuntia.fasta"), "fasta")
4366
Then we create the MUSCLE command line, leaving the input and output
4367
to their defaults (stdin and stdout). I'm also going to ask for strict
4368
ClustalW format as for the output.
4369
<<>>> from Bio.Align.Applications import MuscleCommandline
4370
>>> cline = MuscleCommandline(clwstrict=True)
4375
Now comes the clever bit using the 'subprocess' module, stdin and
4377
<<>>> import subprocess
4379
>>> child = subprocess.Popen(str(cline),
4380
... stdin=subprocess.PIPE,
4381
... stdout=subprocess.PIPE,
4382
... stderr=subprocess.PIPE,
4383
... shell=(sys.platform!="win32"))
4387
That should start MUSCLE, but it will be sitting waiting for its FASTA
4388
input sequences, which we must supply via its stdin handle:
4389
<<>>> SeqIO.write(records, child.stdin, "fasta")
4391
>>> child.stdin.close()
4394
After writing the six sequences to the handle, MUSCLE will still be
4395
waiting to see if that is all the FASTA sequences or not -- so we must
4396
signal that this is all the input data by closing the handle. At that
4397
point MUSCLE should start to run, and we can ask for the output:
4398
<<>>> from Bio import AlignIO
4399
>>> align = AlignIO.read(child.stdout, "clustal")
4401
SingleLetterAlphabet() alignment with 6 rows and 900 columns
4402
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4403
gi|6273290|gb|AF191664.1|AF19166
4404
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4405
gi|6273289|gb|AF191663.1|AF19166
4406
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4407
gi|6273287|gb|AF191661.1|AF19166
4408
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4409
gi|6273286|gb|AF191660.1|AF19166
4410
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4411
gi|6273285|gb|AF191659.1|AF19165
4412
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA
4413
gi|6273284|gb|AF191658.1|AF19165
4416
Wow! There we are with a new alignment of just the six records,
4417
without having created a temporary FASTA input file, or a temporary
4418
alignment output file. However, a word of caution: Dealing with errors
4419
with this style of calling external programs is much more complicated.
4420
It also becomes far harder to diagnose problems, because you can't try
4421
running MUSCLE manually outside of Biopython (because you don't have the
4422
input file to supply). There can also be subtle cross platform issues
4423
(e.g. Windows versus Linux), and how you run your script can have in
4424
impact (e.g. at the command line, from IDLE or an IDE, or as a GUI
4425
script). These are all generic Python issues though, and not specific to
4429
6.3.5 EMBOSS needle and water
4430
==============================
4431
The EMBOSS (5) suite includes the water and needle tools for
4432
Smith-Waterman algorithm local alignment, and Needleman-Wunsch global
4433
alignment. The tools share the same style interface, so switching
4434
between the two is trivial -- we'll just use needle here.
4435
Suppose you want to do a global pairwise alignment between two
4436
sequences, prepared in FASTA format as follows:
4438
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
4439
KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP
4440
AVHASLDKFLASVSTVLTSKYR
4443
in a file alpha.fasta, and secondly in a file beta.fasta:
4445
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
4446
VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
4447
KEFTPPVQAAYQKVVAGVANALAHKYH
4450
Let's start by creating a complete needle command line object in one
4452
<<>>> from Bio.Emboss.Applications import NeedleCommandline
4453
>>> cline = NeedleCommandline(asequence="alpha.faa",
4454
bsequence="beta.faa",
4455
... gapopen=10, gapextend=0.5,
4456
outfile="needle.txt")
4458
needle -outfile=needle.txt -asequence=alpha.faa -bsequence=beta.faa
4459
-gapopen=10 -gapextend=0.5
4462
Why not try running this by hand at the command prompt? You should see
4463
it does a pairwise comparison and records the output in the file
4464
needle.txt (in the default EMBOSS alignment file format).
4465
Even if you have EMBOSS installed, running this command may not work
4466
-- you might get a message about "command not found" (especially on
4467
Windows). This probably means that the EMBOSS tools are not on your PATH
4468
environment variable. You can either update your PATH setting, or simply
4469
tell Biopython the full path to the tool, for example:
4470
<<>>> from Bio.Emboss.Applications import NeedleCommandline
4471
>>> cline = NeedleCommandline(r"C:\EMBOSS\needle.exe",
4472
... asequence="alpha.faa",
4473
bsequence="beta.faa",
4474
... gapopen=10, gapextend=0.5,
4475
outfile="needle.txt")
4478
Remember in Python that for a default string '\n' or '\t' means a new
4479
line or a tab -- which is why we're put a letter "r" at the start for a
4481
At this point it might help to try running the EMBOSS tools yourself
4482
by hand at the command line, to familiarise yourself the other options
4483
and compare them to the Biopython help text:
4484
<<>>> from Bio.Emboss.Applications import NeedleCommandline
4485
>>> help(NeedleCommandline)
4489
Note that you can also specify (or change or look at) the settings
4491
<<>>> from Bio.Emboss.Applications import NeedleCommandline
4492
>>> cline = NeedleCommandline()
4493
>>> cline.asequence="alpha.faa"
4494
>>> cline.bsequence="beta.faa"
4495
>>> cline.gapopen=10
4496
>>> cline.gapextend=0.5
4497
>>> cline.outfile="needle.txt"
4499
needle -outfile=needle.txt -asequence=alpha.faa -bsequence=beta.faa
4500
-gapopen=10 -gapextend=0.5
4501
>>> print cline.outfile
4505
Next we want to use Python to run this command for us. As explained
4506
above, for full control, we recommend you use the built in Python
4507
subprocess module. However, for simple usage Biopython includes a
4508
wrapper function that usually suffices:
4509
<<>>> import subprocess
4510
>>> return_code = subprocess.call(str(cline),
4511
shell=(sys.platform!="win32"))
4512
Needleman-Wunsch global alignment of two sequences
4513
>>> print return_code
4517
In the above, all we really care about is that the return code is zero
4518
(success). If you want to hide the message EMBOSS prints out, then (as
4519
in the ClustalW example above) send the stdout and stderr to dev null,
4520
or just set the EMBOSS auto argument to True.
4521
Next we can load the output file with 'Bio.AlignIO' as discussed
4522
earlier in this chapter, as the emboss format:
4523
<<>>> from Bio import AlignIO
4524
>>> align = AlignIO.read(open("needle.txt"), "emboss")
4526
SingleLetterAlphabet() alignment with 2 rows and 149 columns
4527
MV-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTY...KYR HBA_HUMAN
4528
MVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRF...KYH HBB_HUMAN
4531
In this example, we told EMBOSS to write the output to a file, but you
4532
can tell it to write the output to stdout instead (useful if you don't
4533
want a temporary output file to get rid of), and also read the input
4534
from stdin (just like in the MUSCLE example in the section above).
4535
This has only scratched the surface of what you can do with needle and
4536
water. One useful trick is that the second file can contain multiple
4537
sequences (say five), and then EMBOSS will do five pairwise alignments.
4538
Note - Biopython includes its own pairwise alignment code in the
4539
'Bio.pairwise2' module (written in C for speed, but with a pure Python
4540
fallback available too). This doesn't work with alignment objects, so we
4541
have not covered it within this chapter. See the module's docstring
4542
(built in help) for details.
4543
-----------------------------------
4546
(1) http://biopython.org/DIST/docs/api/Bio.AlignIO-module.html
4548
(2) http://emboss.sourceforge.net/
4550
(3) http://biopython.org/DIST/docs/tutorial/examples/opuntia.fasta
4552
(4) http://biopython.org/DIST/docs/tutorial/examples/opuntia.fasta
4554
(5) http://emboss.sourceforge.net/
4559
Hey, everybody loves BLAST right? I mean, geez, how can get it get
4560
any easier to do comparisons between one of your sequences and every
4561
other sequence in the known world? But, of course, this section isn't
4562
about how cool BLAST is, since we already know that. It is about the
4563
problem with BLAST -- it can be really difficult to deal with the volume
4564
of data generated by large runs, and to automate BLAST runs in general.
4565
Fortunately, the Biopython folks know this only too well, so they've
4566
developed lots of tools for dealing with BLAST and making things much
4567
easier. This section details how to use these tools and do useful things
4569
Dealing with BLAST can be split up into two steps, both of which can
4570
be done from within Biopython. Firstly, running BLAST for your query
4571
sequence(s), and getting some output. Secondly, parsing the BLAST output
4572
in Python for further analysis. We'll start by talking about running the
4573
BLAST command line tools locally, and then discuss running BLAST via the
4577
7.1 Running BLAST locally
4578
*=*=*=*=*=*=*=*=*=*=*=*=*=
4581
Running BLAST locally (as opposed to over the internet, see
4582
Section 7.2) has two advantages:
4584
- Local BLAST may be faster than BLAST over the internet;
4585
- Local BLAST allows you to make your own database to search for
4587
Dealing with proprietary or unpublished sequence data can be another
4588
reason to run BLAST locally. You may not be allowed to redistribute the
4589
sequences, so submitting them to the NCBI as a BLAST query would not be
4591
Biopython provides lots of nice code to enable you to call local BLAST
4592
executables from your scripts, and have full access to the many command
4593
line options that these executables provide. You can obtain local BLAST
4594
precompiled for a number of platforms at
4595
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/, or can compile it
4596
yourself in the NCBI toolbox (ftp://ftp.ncbi.nlm.nih.gov/toolbox/).
4597
The code for calling local "standalone" BLAST is found in
4598
'Bio.Blast.NCBIStandalone', specifically the functions 'blastall',
4599
'blastpgp' and 'rpsblast', which correspond with the BLAST executables
4600
that their names imply.
4601
Let's use these functions to run 'blastall' against a local database
4602
and return the results. First, we want to set up the paths to everything
4603
that we'll need to do the BLAST. What we need to know is the path to the
4604
database (which should have been prepared using 'formatdb', see
4605
ftp://ftp.ncbi.nlm.nih.gov/blast/documents/formatdb.html) to search
4606
against, the path to the file we want to search, and the path to the
4607
'blastall' executable.
4608
On Linux or Mac OS X your paths might look like this:
4609
<<>>> my_blast_db = "/home/mdehoon/Data/Genomes/Databases/bsubtilis"
4610
# I used formatdb to create a BLAST database named bsubtilis
4611
# (for Bacillus subtilis) consisting of the following three files:
4612
# /home/mdehoon/Data/Genomes/Databases/bsubtilis.nhr
4613
# /home/mdehoon/Data/Genomes/Databases/bsubtilis.nin
4614
# /home/mdehoon/Data/Genomes/Databases/bsubtilis.nsq
4616
>>> my_blast_file = "m_cold.fasta"
4617
# A FASTA file with the sequence I want to BLAST
4619
>>> my_blast_exe = "/usr/local/blast/bin/blastall"
4620
# The name of my BLAST executable
4623
while on Windows you might have something like this:
4624
<<>>> my_blast_db = r"C:\Blast\Data\bsubtilis"
4625
# Assuming you used formatdb to create a BLAST database named
4627
# (for Bacillus subtilis) consisting of the following three files:
4628
# C:\Blast\Data\bsubtilis\bsubtilis.nhr
4629
# C:\Blast\Data\bsubtilis\bsubtilis.nin
4630
# C:\Blast\Data\bsubtilis\bsubtilis.nsq
4631
>>> my_blast_file = "m_cold.fasta"
4632
>>> my_blast_exe =r"C:\Blast\bin\blastall.exe"
4635
The FASTA file used in this example is available here (1) as well as
4637
Now that we've got that all set, we are ready to run the BLAST and
4638
collect the results. We can do this with two lines:
4639
<<>>> from Bio.Blast import NCBIStandalone
4640
>>> result_handle, error_handle =
4641
NCBIStandalone.blastall(my_blast_exe, "blastn",
4646
Note that the Biopython interfaces to local blast programs returns two
4647
values. The first is a handle to the blast output, which is ready to
4648
either be saved or passed to a parser. The second is the possible error
4649
output generated by the blast command. See Section 18.1 for more about
4651
The error info can be hard to deal with, because if you try to do a
4652
'error_handle.read()' and there was no error info returned, then the
4653
'read()' call will block and not return, locking your script. In my
4654
opinion, the best way to deal with the error is only to print it out if
4655
you are not getting 'result_handle' results to be parsed, but otherwise
4657
This command will generate BLAST output in XML format, as that is the
4658
format expected by the XML parser, described in Section 7.4. For plain
4659
text output, use the 'align_view="0"' keyword. To parse text output
4660
instead of XML output, see Section 7.6 below. However, parsing text
4661
output is not recommended, as the BLAST plain text output changes
4662
frequently, breaking our parsers.
4663
If you are interested in saving your results to a file before parsing
4664
them, see Section 7.3. To find out how to parse the BLAST results, go to
4668
7.2 Running BLAST over the Internet
4669
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
4672
We use the function 'qblast()' in the 'Bio.Blast.NCBIWWW' module call
4673
the online version of BLAST. This has three non-optional arguments:
4675
- The first argument is the blast program to use for the search, as a
4676
lower case string. The options and descriptions of the programs are
4677
available at http://www.ncbi.nlm.nih.gov/BLAST/blast_program.html.
4678
Currently 'qblast' only works with blastn, blastp, blastx, tblast and
4680
- The second argument specifies the databases to search against.
4681
Again, the options for this are available on the NCBI web pages at
4682
http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.html.
4683
- The third argument is a string containing your query sequence. This
4684
can either be the sequence itself, the sequence in fasta format, or
4685
an identifier like a GI number.
4687
The 'qblast' function also take a number of other option arguments
4688
which are basically analogous to the different parameters you can set on
4689
the BLAST web page. We'll just highlight a few of them here:
4692
- The 'qblast' function can return the BLAST results in various
4693
formats, which you can choose with the optional 'format_type'
4694
keyword: '"HTML"', '"Text"', '"ASN.1"', or '"XML"'. The default is
4695
'"XML"', as that is the format expected by the parser, described in
4697
- The argument 'expect' sets the expectation or e-value threshold.
4699
For more about the optional BLAST arguments, we refer you to the
4700
NCBI's own documentation, or that built into Biopython:
4701
<<>>> from Bio.Blast import NCBIWWW
4702
>>> help(NCBIWWW.qblast)
4705
For example, if you have a nucleotide sequence you want to search
4706
against the non-redundant database using BLASTN, and you know the GI
4707
number of your query sequence, you can use:
4708
<<>>> from Bio.Blast import NCBIWWW
4709
>>> result_handle = NCBIWWW.qblast("blastn", "nr", "8332116")
4712
Alternatively, if we have our query sequence already in a FASTA
4713
formatted file, we just need to open the file and read in this record as
4714
a string, and use that as the query argument:
4715
<<>>> from Bio.Blast import NCBIWWW
4716
>>> fasta_string = open("m_cold.fasta").read()
4717
>>> result_handle = NCBIWWW.qblast("blastn", "nr", fasta_string)
4720
We could also have read in the FASTA file as a 'SeqRecord' and then
4721
supplied just the sequence itself:
4722
<<>>> from Bio.Blast import NCBIWWW
4723
>>> from Bio import SeqIO
4724
>>> record = SeqIO.read(open("m_cold.fasta"), format="fasta")
4725
>>> result_handle = NCBIWWW.qblast("blastn", "nr", record.seq)
4728
Supplying just the sequence means that BLAST will assign an identifier
4729
for your sequence automatically. You might prefer to use the 'SeqRecord'
4730
object's format method to make a fasta string (which will include the
4731
existing identifier):
4732
<<>>> from Bio.Blast import NCBIWWW
4733
>>> from Bio import SeqIO
4734
>>> record = SeqIO.read(open("m_cold.fasta"), format="fasta")
4735
>>> result_handle = NCBIWWW.qblast("blastn", "nr",
4736
record.format("fasta"))
4739
This approach makes more sense if you have your sequence(s) in a
4740
non-FASTA file format which you can extract using 'Bio.SeqIO' (see
4742
Whatever arguments you give the 'qblast()' function, you should get
4743
back your results in a handle object (by default in XML format). The
4744
next step would be to parse the XML output into Python objects
4745
representing the search results (Section 7.4), but you might want to
4746
save a local copy of the output file first.
4749
7.3 Saving BLAST output
4750
*=*=*=*=*=*=*=*=*=*=*=*=
4753
Before parsing the results, it is often useful to save them into a
4754
file so that you can use them later without having to go back and
4755
re-blasting everything. I find this especially useful when debugging my
4756
code that extracts info from the BLAST files, but it could also be
4757
useful just for making backups of things you've done.
4758
If you don't want to save the BLAST output, you can skip to
4759
section 7.4. If you do, read on.
4760
We need to be a bit careful since we can use 'result_handle.read()' to
4761
read the BLAST output only once -- calling 'result_handle.read()' again
4762
returns an empty string. First, we use 'read()' and store all of the
4763
information from the handle into a string:
4764
<<>>> blast_results = result_handle.read()
4767
Next, we save this string in a file:
4768
<<>>> save_file = open("my_blast.xml", "w")
4769
>>> save_file.write(blast_results)
4770
>>> save_file.close()
4773
After doing this, the results are in the file 'my_blast.xml' and the
4774
variable 'blast_results' contains the BLAST results in a string form.
4775
However, the 'parse' function of the BLAST parser (described in 7.4)
4776
takes a file-handle-like object, not a plain string. To get a handle,
4777
there are two things you can do:
4779
- Use the Python standard library module 'cStringIO'. The following
4780
code will turn the plain string into a handle, which we can feed
4781
directly into the BLAST parser:
4782
<<>>> import cStringIO
4783
>>> result_handle = cStringIO.StringIO(blast_results)
4786
- Open the saved file for reading. Duh.
4787
<<>>> result_handle = open("my_blast.xml")
4790
Now that we've got the BLAST results back into a handle again, we are
4791
ready to do something with them, so this leads us right into the parsing
4795
7.4 Parsing BLAST output
4796
*=*=*=*=*=*=*=*=*=*=*=*=*
4799
As mentioned above, BLAST can generate output in various formats, such
4800
as XML, HTML, and plain text. Originally, Biopython had a parser for
4801
BLAST plain text and HTML output, as these were the only output formats
4802
supported by BLAST. Unfortunately, the BLAST output in these formats
4803
kept changing, each time breaking the Biopython parsers. As keeping up
4804
with changes in BLAST became a hopeless endeavor, especially with users
4805
running different BLAST versions, we now recommend to parse the output
4806
in XML format, which can be generated by recent versions of BLAST. Not
4807
only is the XML output more stable than the plain text and HTML output,
4808
it is also much easier to parse automatically, making Biopython a whole
4810
Though deprecated, the parsers for BLAST output in plain text or HTML
4811
output are still available in Biopython (see Section 7.6). Use them at
4812
your own risk: they may or may not work, depending on which BLAST
4813
version you're using.
4814
You can get BLAST output in XML format in various ways. For the
4815
parser, it doesn't matter how the output was generated, as long as it is
4818
- You can use Biopython to run BLAST locally, as described in
4820
- You can use Biopython to run BLAST over the internet, as described
4822
- You can do the BLAST search yourself on the NCBI site through your
4823
web browser, and then save the results. You need to choose XML as the
4824
format in which to receive the results, and save the final BLAST page
4825
you get (you know, the one with all of the interesting results!) to a
4827
- You can also run BLAST locally without using Biopython, and save
4828
the output in a file. Again, you need to choose XML as the format in
4829
which to receive the results.
4830
The important point is that you do not have to use Biopython scripts
4831
to fetch the data in order to be able to parse it.
4832
Doing things in one of these ways, you then need to get a handle to
4833
the results. In Python, a handle is just a nice general way of
4834
describing input to any info source so that the info can be retrieved
4835
using 'read()' and 'readline()' functions. This is the type of input the
4836
BLAST parser (and most other Biopython parsers) take.
4837
If you followed the code above for interacting with BLAST through a
4838
script, then you already have 'result_handle', the handle to the BLAST
4839
results. For example, using a GI number to do an online search:
4840
<<>>> from Bio.Blast import NCBIWWW
4841
>>> result_handle = NCBIWWW.qblast("blastn", "nr", "8332116")
4844
If instead you ran BLAST some other way, and have the BLAST output (in
4845
XML format) in the file 'my_blast.xml', all you need to do is to open
4846
the file for reading:
4847
<<>>> result_handle = open("my_blast.xml")
4850
Now that we've got a handle, we are ready to parse the output. The
4851
code to parse it is really quite small. If you expect a single BLAST
4852
result (i.e. you used a single query):
4853
<<>>> from Bio.Blast import NCBIXML
4854
>>> blast_record = NCBIXML.read(result_handle)
4857
or, if you have lots of results (i.e. multiple query sequences):
4858
<<>>> from Bio.Blast import NCBIXML
4859
>>> blast_records = NCBIXML.parse(result_handle)
4862
Just like 'Bio.SeqIO' and 'Bio.AlignIO' (see Chapters 5 and 6), we
4863
have a pair of input functions, 'read' and 'parse', where 'read' is for
4864
when you have exactly one object, and 'parse' is an iterator for when
4865
you can have lots of objects -- but instead of getting 'SeqRecord' or
4866
'Alignment' objects, we get BLAST record objects.
4867
To be able to handle the situation where the BLAST file may be huge,
4868
containing thousands of results, 'NCBIXML.parse()' returns an iterator.
4869
In plain English, an iterator allows you to step through the BLAST
4870
output, retrieving BLAST records one by one for each BLAST search
4872
<<>>> from Bio.Blast import NCBIXML
4873
>>> blast_records = NCBIXML.parse(result_handle)
4874
>>> blast_record = blast_records.next()
4875
# ... do something with blast_record
4876
>>> blast_record = blast_records.next()
4877
# ... do something with blast_record
4878
>>> blast_record = blast_records.next()
4879
# ... do something with blast_record
4880
>>> blast_record = blast_records.next()
4881
Traceback (most recent call last):
4882
File "<stdin>", line 1, in <module>
4884
# No further records
4887
Or, you can use a 'for'-loop:
4888
<<>>> for blast_record in blast_records:
4889
... # Do something with blast_record
4892
Note though that you can step through the BLAST records only once.
4893
Usually, from each BLAST record you would save the information that you
4894
are interested in. If you want to save all returned BLAST records, you
4895
can convert the iterator into a list:
4896
<<>>> blast_records = list(blast_records)
4898
Now you can access each BLAST record in the list with an index as
4899
usual. If your BLAST file is huge though, you may run into memory
4900
problems trying to save them all in a list.
4901
Usually, you'll be running one BLAST search at a time. Then, all you
4902
need to do is to pick up the first (and only) BLAST record in
4904
<<>>> from Bio.Blast import NCBIXML
4905
>>> blast_records = NCBIXML.parse(result_handle)
4906
>>> blast_record = blast_records.next()
4909
<<>>> from Bio.Blast import NCBIXML
4910
>>> blast_record = NCBIXML.read(result_handle)
4913
I guess by now you're wondering what is in a BLAST record.
4916
7.5 The BLAST record class
4917
*=*=*=*=*=*=*=*=*=*=*=*=*=*
4920
A BLAST Record contains everything you might ever want to extract from
4921
the BLAST output. Right now we'll just show an example of how to get
4922
some info out of the BLAST report, but if you want something in
4923
particular that is not described here, look at the info on the record
4924
class in detail, and take a gander into the code or automatically
4925
generated documentation -- the docstrings have lots of good info about
4926
what is stored in each piece of information.
4927
To continue with our example, let's just print out some summary info
4928
about all hits in our blast report greater than a particular threshold.
4929
The following code does this:
4930
<<>>> E_VALUE_THRESH = 0.04
4932
>>> for alignment in blast_record.alignments:
4933
... for hsp in alignment.hsps:
4934
... if hsp.expect < E_VALUE_THRESH:
4935
... print '****Alignment****'
4936
... print 'sequence:', alignment.title
4937
... print 'length:', alignment.length
4938
... print 'e value:', hsp.expect
4939
... print hsp.query[0:75] + '...'
4940
... print hsp.match[0:75] + '...'
4941
... print hsp.sbjct[0:75] + '...'
4944
This will print out summary reports like the following:
4946
sequence: >gb|AF283004.1|AF283004 Arabidopsis thaliana cold
4947
acclimation protein WCOR413-like protein
4948
alpha form mRNA, complete cds
4951
tacttgttgatattggatcgaacaaactggagaaccaacatgctcacgtcacttttagtcccttacatat
4953
||||||||| | ||||||||||| || |||| || || |||||||| |||||| | | ||||||||
4955
tacttgttggtgttggatcgaaccaattggaagacgaatatgctcacatcacttctcattccttacatct
4959
Basically, you can do anything you want to with the info in the BLAST
4960
report once you have parsed it. This will, of course, depend on what you
4961
want to use it for, but hopefully this helps you get started on doing
4962
what you need to do!
4963
An important consideration for extracting information from a BLAST
4964
report is the type of objects that the information is stored in. In
4965
Biopython, the parsers return 'Record' objects, either 'Blast' or
4966
'PSIBlast' depending on what you are parsing. These objects are defined
4967
in 'Bio.Blast.Record' and are quite complete.
4968
Here are my attempts at UML class diagrams for the 'Blast' and
4969
'PSIBlast' record classes. If you are good at UML and see
4970
mistakes/improvements that can be made, please let me know. The Blast
4971
class diagram is shown in Figure 7.5.
4972
*images/BlastRecord.png*
4974
The PSIBlast record object is similar, but has support for the rounds
4975
that are used in the iteration steps of PSIBlast. The class diagram for
4976
PSIBlast is shown in Figure 7.5.
4977
*images/PSIBlastRecord.png*
4981
7.6 Deprecated BLAST parsers
4982
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
4985
Older versions of Biopython had parsers for BLAST output in plain text
4986
or HTML format. Over the years, we discovered that it is very hard to
4987
maintain these parsers in working order. Basically, any small change to
4988
the BLAST output in newly released BLAST versions tends to cause the
4989
plain text and HTML parsers to break. We therefore recommend parsing
4990
BLAST output in XML format, as described in section 7.4.
4991
The HTML parser in 'Bio.Blast.NCBIWWW' has been officially deprecated
4992
and will issue warnings if you try and use it. We plan to remove this
4993
completely in a few releases time.
4994
Our plain text BLAST parser works a bit better, but use it at your own
4995
risk. It may or may not work, depending on which BLAST versions or
4996
programs you're using.
4999
7.6.1 Parsing plain-text BLAST output
5000
======================================
5002
The plain text BLAST parser is located in 'Bio.Blast.NCBIStandalone'.
5003
As with the XML parser, we need to have a handle object that we can
5004
pass to the parser. The handle must implement the 'readline()' method
5005
and do this properly. The common ways to get such a handle are to either
5006
use the provided 'blastall' or 'blastpgp' functions to run the local
5007
blast, or to run a local blast via the command line, and then do
5008
something like the following:
5009
<<>>> result_handle = open("my_file_of_blast_output.txt")
5012
Well, now that we've got a handle (which we'll call 'result_handle'),
5013
we are ready to parse it. This can be done with the following code:
5014
<<>>> from Bio.Blast import NCBIStandalone
5015
>>> blast_parser = NCBIStandalone.BlastParser()
5016
>>> blast_record = blast_parser.parse(result_handle)
5019
This will parse the BLAST report into a Blast Record class (either a
5020
Blast or a PSIBlast record, depending on what you are parsing) so that
5021
you can extract the information from it. In our case, let's just use
5022
print out a quick summary of all of the alignments greater than some
5024
<<>>> E_VALUE_THRESH = 0.04
5025
>>> for alignment in blast_record.alignments:
5026
... for hsp in alignment.hsps:
5027
... if hsp.expect < E_VALUE_THRESH:
5028
... print '****Alignment****'
5029
... print 'sequence:', alignment.title
5030
... print 'length:', alignment.length
5031
... print 'e value:', hsp.expect
5032
... print hsp.query[0:75] + '...'
5033
... print hsp.match[0:75] + '...'
5034
... print hsp.sbjct[0:75] + '...'
5037
If you also read the section 7.4 on parsing BLAST XML output, you'll
5038
notice that the above code is identical to what is found in that
5039
section. Once you parse something into a record class you can deal with
5040
it independent of the format of the original BLAST info you were
5041
parsing. Pretty snazzy!
5042
Sure, parsing one record is great, but I've got a BLAST file with tons
5043
of records -- how can I parse them all? Well, fear not, the answer lies
5044
in the very next section.
5047
7.6.2 Parsing a plain-text BLAST file full of BLAST runs
5048
=========================================================
5050
We can do this using the blast iterator. To set up an iterator, we
5051
first set up a parser, to parse our blast reports in Blast Record
5053
<<>>> from Bio.Blast import NCBIStandalone
5054
>>> blast_parser = NCBIStandalone.BlastParser()
5057
Then we will assume we have a handle to a bunch of blast records,
5058
which we'll call 'result_handle'. Getting a handle is described in full
5059
detail above in the blast parsing sections.
5060
Now that we've got a parser and a handle, we are ready to set up the
5061
iterator with the following command:
5062
<<>>> blast_iterator = NCBIStandalone.Iterator(result_handle,
5066
The second option, the parser, is optional. If we don't supply a
5067
parser, then the iterator will just return the raw BLAST reports one at
5069
Now that we've got an iterator, we start retrieving blast records
5070
(generated by our parser) using 'next()':
5071
<<>>> blast_record = blast_iterator.next()
5074
Each call to next will return a new record that we can deal with. Now
5075
we can iterate through this records and generate our old favorite, a
5076
nice little blast report:
5077
<<>>> for blast_record in blast_iterator :
5078
... E_VALUE_THRESH = 0.04
5079
... for alignment in blast_record.alignments:
5080
... for hsp in alignment.hsps:
5081
... if hsp.expect < E_VALUE_THRESH:
5082
... print '****Alignment****'
5083
... print 'sequence:', alignment.title
5084
... print 'length:', alignment.length
5085
... print 'e value:', hsp.expect
5086
... if len(hsp.query) > 75:
5090
... print hsp.query[0:75] + dots
5091
... print hsp.match[0:75] + dots
5092
... print hsp.sbjct[0:75] + dots
5095
The iterator allows you to deal with huge blast records without any
5096
memory problems, since things are read in one at a time. I have parsed
5097
tremendously huge files without any problems using this.
5100
7.6.3 Finding a bad record somewhere in a huge plain-text BLAST file
5101
=====================================================================
5103
One really ugly problem that happens to me is that I'll be parsing a
5104
huge blast file for a while, and the parser will bomb out with a
5105
ValueError. This is a serious problem, since you can't tell if the
5106
ValueError is due to a parser problem, or a problem with the BLAST. To
5107
make it even worse, you have no idea where the parse failed, so you
5108
can't just ignore the error, since this could be ignoring an important
5110
We used to have to make a little script to get around this problem,
5111
but the 'Bio.Blast' module now includes a 'BlastErrorParser' which
5112
really helps make this easier. The 'BlastErrorParser' works very similar
5113
to the regular 'BlastParser', but it adds an extra layer of work by
5114
catching ValueErrors that are generated by the parser, and attempting to
5115
diagnose the errors.
5116
Let's take a look at using this parser -- first we define the file we
5117
are going to parse and the file to write the problem reports to:
5119
>>> blast_file = os.path.join(os.getcwd(), "blast_out",
5121
>>> error_file = os.path.join(os.getcwd(), "blast_out",
5122
"big_blast.problems")
5125
Now we want to get a 'BlastErrorParser':
5126
<<>>> from Bio.Blast import NCBIStandalone
5127
>>> error_handle = open(error_file, "w")
5128
>>> blast_error_parser = NCBIStandalone.BlastErrorParser(error_handle)
5131
Notice that the parser take an optional argument of a handle. If a
5132
handle is passed, then the parser will write any blast records which
5133
generate a ValueError to this handle. Otherwise, these records will not
5135
Now we can use the 'BlastErrorParser' just like a regular blast
5136
parser. Specifically, we might want to make an iterator that goes
5137
through our blast records one at a time and parses them with the error
5139
<<>>> result_handle = open(blast_file)
5140
>>> iterator = NCBIStandalone.Iterator(result_handle,
5144
We can read these records one a time, but now we can catch and deal
5145
with errors that are due to problems with Blast (and not with the parser
5148
... next_record = iterator.next()
5149
... except NCBIStandalone.LowQualityBlastError, info:
5150
... print "LowQualityBlastError detected in id %s" % info[1]
5153
The '.next()' method is normally called indirectly via a 'for'-loop.
5154
Right now the 'BlastErrorParser' can generate the following errors:
5157
- 'ValueError' -- This is the same error generated by the regular
5158
BlastParser, and is due to the parser not being able to parse a
5159
specific file. This is normally either due to a bug in the parser, or
5160
some kind of discrepancy between the version of BLAST you are using
5161
and the versions the parser is able to handle.
5163
- 'LowQualityBlastError' -- When BLASTing a sequence that is of
5164
really bad quality (for example, a short sequence that is basically a
5165
stretch of one nucleotide), it seems that Blast ends up masking out
5166
the entire sequence and ending up with nothing to parse. In this case
5167
it will produce a truncated report that causes the parser to generate
5168
a ValueError. 'LowQualityBlastError' is reported in these cases. This
5169
error returns an info item with the following information:
5171
- 'item[0]' -- The error message
5172
- 'item[1]' -- The id of the input record that caused the error.
5173
This is really useful if you want to record all of the records
5174
that are causing problems.
5177
As mentioned, with each error generated, the BlastErrorParser will
5178
write the offending record to the specified 'error_handle'. You can then
5179
go ahead and look and these and deal with them as you see fit. Either
5180
you will be able to debug the parser with a single blast report, or will
5181
find out problems in your blast runs. Either way, it will definitely be
5182
a useful experience!
5183
Hopefully the 'BlastErrorParser' will make it much easier to debug and
5184
deal with large Blast files.
5187
7.7 Dealing with PSI-BLAST
5188
*=*=*=*=*=*=*=*=*=*=*=*=*=*
5191
You can run the standalone verion of PSI-BLAST (the command line tool
5192
'blastpgp') using the 'blastpgp' function in the
5193
'Bio.Blast.NCBIStandalone' module. At the time of writing, the NCBI do
5194
not appear to support tools running a PSI-BLAST search via the internet.
5195
Note that the 'Bio.Blast.NCBIXML' parser can read the XML output from
5196
current versions of PSI-BLAST, but information like which sequences in
5197
each iteration is new or reused isn't present in the XML file. If you
5198
care about this information you may have more joy with the plain text
5199
output and the 'PSIBlastParser' in 'Bio.Blast.NCBIStandalone'.
5202
7.8 Dealing with RPS-BLAST
5203
*=*=*=*=*=*=*=*=*=*=*=*=*=*
5206
You can run the standalone verion of RPS-BLAST (the command line tool
5207
'rpsblast') using the 'rpsblast' function in the
5208
'Bio.Blast.NCBIStandalone' module. At the time of writing, the NCBI do
5209
not appear to support tools running an RPS-BLAST search via the
5211
You can use the 'Bio.Blast.NCBIXML' parser to read the XML output from
5212
current versions of RPS-BLAST.
5213
-----------------------------------
5216
(1) examples/m_cold.fasta
5218
(2) http://biopython.org/DIST/docs/tutorial/examples/m_cold.fasta
5221
Chapter 8 Accessing NCBI's Entrez databases
5222
**********************************************
5224
Entrez (http://www.ncbi.nlm.nih.gov/Entrez) is a data retrieval system
5225
that provides users access to NCBI's databases such as PubMed, GenBank,
5226
GEO, and many others. You can access Entrez from a web browser to
5227
manually enter queries, or you can use Biopython's 'Bio.Entrez' module
5228
for programmatic access to Entrez. The latter allows you for example to
5229
search PubMed or download GenBank records from within a Python script.
5230
The 'Bio.Entrez' module makes use of the Entrez Programming Utilities
5231
(also known as EUtils), consisting of eight tools that are described in
5232
detail on NCBI's page at http://www.ncbi.nlm.nih.gov/entrez/utils/. Each
5233
of these tools corresponds to one Python function in the 'Bio.Entrez'
5234
module, as described in the sections below. This module makes sure that
5235
the correct URL is used for the queries, and that not more than one
5236
request is made every three seconds, as required by NCBI.
5237
The output returned by the Entrez Programming Utilities is typically
5238
in XML format. To parse such output, you have several options:
5240
1. Use 'Bio.Entrez''s parser to parse the XML output into a Python
5242
2. Use the DOM (Document Object Model) parser in Python's standard
5244
3. Use the SAX (Simple API for XML) parser in Python's standard
5246
4. Read the XML output as raw text, and parse it by string searching
5248
For the DOM and SAX parsers, see the Python documentation. The parser
5249
in 'Bio.Entrez' is discussed below.
5250
NCBI uses DTD (Document Type Definition) files to describe the
5251
structure of the information contained in XML files. Most of the DTD
5252
files used by NCBI are included in the Biopython distribution. The
5253
'Bio.Entrez' parser makes use of the DTD files when parsing an XML file
5254
returned by NCBI Entrez.
5255
Occasionally, you may find that the DTD file associated with a
5256
specific XML file is missing in the Biopython distribution. In
5257
particular, this may happen when NCBI updates its DTD files. If this
5258
happens, 'Entrez.read' will give an error message showing which DTD file
5259
is missing. You can download the DTD file from NCBI; most can be found
5260
at http://www.ncbi.nlm.nih.gov/dtd/ or
5261
http://eutils.ncbi.nlm.nih.gov/entrez/query/DTD/. After downloading, the
5262
DTD file should be stored in the directory
5263
'...site-packages/Bio/Entrez/DTDs', containing the other DTD files.
5264
Alternatively, if you installed Biopython from source, you can add the
5265
DTD file to the source code's 'Bio/Entrez/DTDs' directory, and reinstall
5266
Biopython. This will install the new DTD file in the correct location
5267
together with the other DTD files.
5268
The Entrez Programming Utilities can also generate output in other
5269
formats, such as the Fasta or GenBank file formats for sequence
5270
databases, or the MedLine format for the literature database, discussed
5274
8.1 Entrez Guidelines
5275
*=*=*=*=*=*=*=*=*=*=*=
5277
Before using Biopython to access the NCBI's online resources (via
5278
'Bio.Entrez' or some of the other modules), please read the NCBI's
5279
Entrez User Requirements (1). If the NCBI finds you are abusing their
5280
systems, they can and will ban your access!
5284
- For any series of more than 100 requests, do this at weekends or
5285
outside USA peak times. This is up to you to obey.
5286
- Use the http://eutils.ncbi.nlm.nih.gov address, not the standard
5287
NCBI Web address. Biopython uses this web address.
5288
- Make no more than three requests every seconds (relaxed from at
5289
most one request every three seconds in early 2009). This is
5290
automatically enforced by Biopython.
5291
- Use the optional email parameter so the NCBI can contact you if
5292
there is a problem. You can either explicitly set the email address
5293
as a parameter with each call to Entrez (e.g., include email =
5294
"A.N.Other@example.com" in the argument list), or as of Biopython
5295
1.48, you can set a global email address:
5296
<<>>> from Bio import Entrez
5297
>>> Entrez.email = "A.N.Other@example.com"
5299
Bio.Entrez will then use this email address with each call to Entrez.
5300
The example.com address is a reserved domain name specifically for
5301
documentation (RFC 2606). Please DO NOT use a random email -- it's
5302
better not to give an email at all.
5303
- If you are using Biopython within some larger software suite, use
5304
the tool parameter to specify this. The tool parameter will default
5306
- For large queries, the NCBI also recommend using their session
5307
history feature (the WebEnv session cookie string, see Section 8.13).
5308
This is only slightly more complicated.
5310
In conclusion, be sensible with your usage levels. If you plan to
5311
download lots of data, consider other options. For example, if you want
5312
easy access to all the human genes, consider fetching each chromosome by
5313
FTP as a GenBank file, and importing these into your own BioSQL database
5317
8.2 EInfo: Obtaining information about the Entrez databases
5318
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
5320
EInfo provides field index term counts, last update, and available
5321
links for each of NCBI's databases. In addition, you can use EInfo to
5322
obtain a list of all database names accessible through the Entrez
5324
<<>>> from Bio import Entrez
5325
>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who
5327
>>> handle = Entrez.einfo()
5328
>>> result = handle.read()
5330
The variable 'result' now contains a list of databases in XML format:
5332
<?xml version="1.0"?>
5333
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD eInfoResult, 11 May
5335
"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eInfo_020511.dtd">
5338
<DbName>pubmed</DbName>
5339
<DbName>protein</DbName>
5340
<DbName>nucleotide</DbName>
5341
<DbName>nuccore</DbName>
5342
<DbName>nucgss</DbName>
5343
<DbName>nucest</DbName>
5344
<DbName>structure</DbName>
5345
<DbName>genome</DbName>
5346
<DbName>books</DbName>
5347
<DbName>cancerchromosomes</DbName>
5348
<DbName>cdd</DbName>
5349
<DbName>gap</DbName>
5350
<DbName>domains</DbName>
5351
<DbName>gene</DbName>
5352
<DbName>genomeprj</DbName>
5353
<DbName>gensat</DbName>
5354
<DbName>geo</DbName>
5355
<DbName>gds</DbName>
5356
<DbName>homologene</DbName>
5357
<DbName>journals</DbName>
5358
<DbName>mesh</DbName>
5359
<DbName>ncbisearch</DbName>
5360
<DbName>nlmcatalog</DbName>
5361
<DbName>omia</DbName>
5362
<DbName>omim</DbName>
5363
<DbName>pmc</DbName>
5364
<DbName>popset</DbName>
5365
<DbName>probe</DbName>
5366
<DbName>proteinclusters</DbName>
5367
<DbName>pcassay</DbName>
5368
<DbName>pccompound</DbName>
5369
<DbName>pcsubstance</DbName>
5370
<DbName>snp</DbName>
5371
<DbName>taxonomy</DbName>
5372
<DbName>toolkit</DbName>
5373
<DbName>unigene</DbName>
5374
<DbName>unists</DbName>
5379
Since this is a fairly simple XML file, we could extract the
5380
information it contains simply by string searching. Using 'Bio.Entrez''s
5381
parser instead, we can directly parse this XML file into a Python
5383
<<>>> from Bio import Entrez
5384
>>> handle = Entrez.einfo()
5385
>>> record = Entrez.read(handle)
5387
Now 'record' is a dictionary with exactly one key:
5391
The values stored in this key is the list of database names shown in
5393
<<>>> record["DbList"]
5394
['pubmed', 'protein', 'nucleotide', 'nuccore', 'nucgss', 'nucest',
5395
'structure', 'genome', 'books', 'cancerchromosomes', 'cdd', 'gap',
5396
'domains', 'gene', 'genomeprj', 'gensat', 'geo', 'gds', 'homologene',
5397
'journals', 'mesh', 'ncbisearch', 'nlmcatalog', 'omia', 'omim',
5399
'popset', 'probe', 'proteinclusters', 'pcassay', 'pccompound',
5400
'pcsubstance', 'snp', 'taxonomy', 'toolkit', 'unigene', 'unists']
5403
For each of these databases, we can use EInfo again to obtain more
5405
<<>>> handle = Entrez.einfo(db="pubmed")
5406
>>> record = Entrez.read(handle)
5407
>>> record["DbInfo"]["Description"]
5408
'PubMed bibliographic record'
5409
>>> record["DbInfo"]["Count"]
5411
>>> record["DbInfo"]["LastUpdate"]
5414
Try 'record["DbInfo"].keys()' for other information stored in this
5415
record. One of the most useful is a list of possible search fields for
5417
<<>>> for field in record["DbInfo"]["FieldList"] :
5418
... print "%(Name)s, %(FullName)s, %(Description)s" % field
5419
ALL, All Fields, All terms from all searchable fields
5420
UID, UID, Unique number assigned to publication
5421
FILT, Filter, Limits the records
5422
TITL, Title, Words in title of publication
5423
WORD, Text Word, Free text associated with publication
5424
MESH, MeSH Terms, Medical Subject Headings assigned to publication
5425
MAJR, MeSH Major Topic, MeSH terms of major importance to publication
5426
AUTH, Author, Author(s) of publication
5427
JOUR, Journal, Journal abbreviation of publication
5428
AFFL, Affiliation, Author's institutional affiliation and address
5432
That's a long list, but indirectly this tells you that for the PubMed
5433
database, you can do things like Jones[AUTH] to search the author field,
5434
or Sanger[AFFL] to restrict to authors at the Sanger Centre. This can be
5435
very handy - especially if you are not so familiar with a particular
5439
8.3 ESearch: Searching the Entrez databases
5440
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
5442
To search any of these databases, we use 'Bio.Entrez.esearch()'. For
5443
example, let's search in PubMed for publications related to Biopython:
5444
<<>>> from Bio import Entrez
5445
>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who
5447
>>> handle = Entrez.esearch(db="pubmed", term="biopython")
5448
>>> record = Entrez.read(handle)
5449
>>> record["IdList"]
5450
['19304878', '18606172', '16403221', '16377612', '14871861',
5451
'14630660', '12230038']
5453
In this output, you see seven PubMed IDs (including 19304878 which is
5454
the PMID for the Biopython application note), which can be retrieved by
5455
EFetch (see section 8.6).
5456
You can also use ESearch to search GenBank. Here we'll do a quick
5457
search for the matK gene in Cypripedioideae orchids (see Section 8.2
5458
about EInfo for one way to find out which fields you can search in each
5461
Entrez.esearch(db="nucleotide",term="Cypripedioideae[Orgn] AND
5463
>>> record = Entrez.read(handle)
5466
>>> record["IdList"]
5467
['126789333', '37222967', '37222966', '37222965', ..., '61585492']
5470
Each of the IDs (126789333, 37222967, 37222966, ...) is a GenBank
5471
identifier. See section 8.6 for information on how to actually download
5472
these GenBank records.
5473
Note that instead of a species name like Cypripedioideae[Orgn], you
5474
can restrict the search using an NCBI taxon identifier, here this would
5475
be txid158330[Orgn]. This isn't currently documented on the ESearch help
5476
page - the NCBI explained this in reply to an email query. You can often
5477
deduce the search term formatting by playing with the Entrez web
5478
interface. For example, including complete[prop] in a genome search
5479
restricts to just completed genomes.
5480
As a final example, let's get a list of computational journal titles:
5481
<<>>> handle = Entrez.esearch(db="journals", term="computational")
5482
>>> record = Entrez.read(handle)
5485
>>> record["IdList"]
5486
['30367', '33843', '33823', '32989', '33190', '33009', '31986',
5487
'34502', '8799', '22857', '32675', '20258', '33859', '32534',
5490
Again, we could use EFetch to obtain more information for each of
5492
ESearch has many useful options --- see the ESearch help page (2) for
5496
8.4 EPost: Uploading a list of identifiers
5497
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
5499
EPost uploads a list of UIs for use in subsequent search strategies;
5500
see the EPost help page (3) for more information. It is available from
5501
Biopython through the 'Bio.Entrez.epost()' function.
5502
To give an example of when this is useful, suppose you have a long
5503
list of IDs you want to download using EFetch (maybe sequences, maybe
5504
citations -- anything). When you make a request with EFetch your list of
5505
IDs, the database etc, are all turned into a long URL sent to the
5506
server. If your list of IDs is long, this URL gets long, and long URLs
5507
can break (e.g. some proxies don't cope well).
5508
Instead, you can break this up into two steps, first uploading the
5509
list of IDs using EPost (this uses an "HTML post" internally, rather
5510
than an "HTML get", getting round the long URL problem). With the
5511
history support, you can then refer to this long list of IDs, and
5512
download the associated data with EFetch.
5513
Let's look at a simple example to see how EPost works -- uploading
5514
some PubMed identifiers:
5515
<<>>> from Bio import Entrez
5516
>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who
5518
>>> id_list = ["19304878", "18606172", "16403221", "16377612",
5519
"14871861", "14630660"]
5520
>>> print Entrez.epost("pubmed", id=",".join(id_list)).read()
5521
<?xml version="1.0"?>
5522
<!DOCTYPE ePostResult PUBLIC "-//NLM//DTD ePostResult, 11 May
5524
"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ePost_020511.dtd">
5526
<QueryKey>1</QueryKey>
5527
<WebEnv>NCID_01_206841095_130.14.22.101_9001_1242061629</WebEnv>
5530
The returned XML includes two important strings, 'QueryKey' and
5531
'WebEnv' which together define your history session. You would extract
5532
these values for use with another Entrez call such as EFetch:
5533
<<from Bio import Entrez
5534
Entrez.email = "A.N.Other@example.com" # Always tell NCBI who you
5536
id_list = ["19304878", "18606172", "16403221", "16377612", "14871861",
5538
search_results = Entrez.read(Entrez.epost("pubmed",
5539
id=",".join(id_list)))
5540
webenv = search_results["WebEnv"]
5541
query_key = search_results["QueryKey"]
5543
Section 8.13 shows how to use the history feature.
5546
8.5 ESummary: Retrieving summaries from primary IDs
5547
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
5549
ESummary retrieves document summaries from a list of primary IDs (see
5550
the ESummary help page (4) for more information). In Biopython, ESummary
5551
is available as 'Bio.Entrez.esummary()'. Using the search result above,
5552
we can for example find out more about the journal with ID 30367:
5553
<<>>> from Bio import Entrez
5554
>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who
5556
>>> handle = Entrez.esummary(db="journals", id="30367")
5557
>>> record = Entrez.read(handle)
5560
>>> record[0]["Title"]
5561
'Computational biology and chemistry'
5562
>>> record[0]["Publisher"]
5568
8.6 EFetch: Downloading full records from Entrez
5569
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
5572
EFetch is what you use when you want to retrieve a full record from
5573
Entrez. This covers several possible databases, as described on the main
5574
EFetch Help page (5).
5575
From the Cypripedioideae example above, we can download GenBank record
5576
186972394 using 'Bio.Entrez.efetch' (see the documentation on EFetch for
5577
Sequence and other Molecular Biology Databases (6)):
5578
<<>>> from Bio import Entrez
5579
>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who
5581
>>> handle = Entrez.efetch(db="nucleotide", id="186972394",
5583
>>> print handle.read()
5584
LOCUS EU490707 1302 bp DNA linear PLN
5586
DEFINITION Selenipedium aequinoctiale maturase K (matK) gene, partial
5590
VERSION EU490707.1 GI:186972394
5592
SOURCE chloroplast Selenipedium aequinoctiale
5593
ORGANISM Selenipedium aequinoctiale
5594
Eukaryota; Viridiplantae; Streptophyta; Embryophyta;
5596
Spermatophyta; Magnoliophyta; Liliopsida; Asparagales;
5598
Cypripedioideae; Selenipedium.
5599
REFERENCE 1 (bases 1 to 1302)
5600
AUTHORS Neubig,K.M., Whitten,W.M., Carlsward,B.S., Blanco,M.A.,
5601
Endara,C.L., Williams,N.H. and Moore,M.J.
5602
TITLE Phylogenetic utility of ycf1 in orchids
5604
REFERENCE 2 (bases 1 to 1302)
5605
AUTHORS Neubig,K.M., Whitten,W.M., Carlsward,B.S., Blanco,M.A.,
5606
Endara,C.L., Williams,N.H. and Moore,M.J.
5607
TITLE Direct Submission
5608
JOURNAL Submitted (14-FEB-2008) Department of Botany, University
5610
Florida, 220 Bartram Hall, Gainesville, FL 32611-8526, USA
5611
FEATURES Location/Qualifiers
5613
/organism="Selenipedium aequinoctiale"
5614
/organelle="plastid:chloroplast"
5615
/mol_type="genomic DNA"
5616
/specimen_voucher="FLAS:Blanco 2475"
5617
/db_xref="taxon:256374"
5624
/product="maturase K"
5625
/protein_id="ACC99456.1"
5626
/db_xref="GI:186972395"
5628
/translation="IFYEPVEIFGYDNKSSLVLVKRLITRMYQQNFLISSVNDSNQKG
5630
FWGHKHFFSSHFSSQMVSEGFGVILEIPFSSQLVSSLEEKKIPKYQNLRSIHSIFPFL
5632
EDKFLHLNYVSDLLIPHPIHLEILVQILQCRIKDVPSLHLLRLLFHEYHNLNSLITSK
5634
KFIYAFSKRKKRFLWLLYNSYVYECEYLFQFLRKQSSYLRSTSSGVFLERTHLYVKIE
5636
HLLVVCCNSFQRILCFLKDPFMHYVRYQGKAILASKGTLILMKKWKFHLVNFWQSYFH
5638
FWSQPYRIHIKQLSNYSFSFLGYFSSVLENHLVVRNQMLENSFIINLLTKKFDTIAPV
5640
ISLIGSLSKAQFCTVLGHPISKPIWTDFSDSDILDRFCRICRNLCRYHSGSSKKQVLY
5641
RIKYILRLSCARTLARKHKSTVRTFMRRLGSGLLEEFFMEEE"
5643
1 attttttacg aacctgtgga aatttttggt tatgacaata aatctagttt
5645
61 aaacgtttaa ttactcgaat gtatcaacag aattttttga tttcttcggt
5647
121 aaccaaaaag gattttgggg gcacaagcat tttttttctt ctcatttttc
5649
181 gtatcagaag gttttggagt cattctggaa attccattct cgtcgcaatt
5651
241 cttgaagaaa aaaaaatacc aaaatatcag aatttacgat ctattcattc
5653
301 tttttagaag acaaattttt acatttgaat tatgtgtcag atctactaat
5655
361 atccatctgg aaatcttggt tcaaatcctt caatgccgga tcaaggatgt
5657
421 catttattgc gattgctttt ccacgaatat cataatttga atagtctcat
5659
481 aaattcattt acgccttttc aaaaagaaag aaaagattcc tttggttact
5661
541 tatgtatatg aatgcgaata tctattccag tttcttcgta aacagtcttc
5663
601 tcaacatctt ctggagtctt tcttgagcga acacatttat atgtaaaaat
5665
661 ctagtagtgt gttgtaattc ttttcagagg atcctatgct ttctcaagga
5667
721 cattatgttc gatatcaagg aaaagcaatt ctggcttcaa agggaactct
5669
781 aagaaatgga aatttcatct tgtgaatttt tggcaatctt attttcactt
5671
841 ccgtatagga ttcatataaa gcaattatcc aactattcct tctcttttct
5673
901 tcaagtgtac tagaaaatca tttggtagta agaaatcaaa tgctagagaa
5675
961 ataaatcttc tgactaagaa attcgatacc atagccccag ttatttctct
5677
1021 ttgtcgaaag ctcaattttg tactgtattg ggtcatccta ttagtaaacc
5679
1081 gatttctcgg attctgatat tcttgatcga ttttgccgga tatgtagaaa
5681
1141 tatcacagcg gatcctcaaa aaaacaggtt ttgtatcgta taaaatatat
5683
1201 tcgtgtgcta gaactttggc acggaaacat aaaagtacag tacgcacttt
5685
1261 ttaggttcgg gattattaga agaattcttt atggaagaag aa
5689
The argument 'rettype="gb"' lets us download this record in the
5690
GenBank format. Note that until Easter 2009, the Entrez EFetch API let
5691
you use "genbank" as the return type, however the NCBI now insist on
5692
using the official return types of "gb" or "gbwithparts" (or "gp" for
5693
proteins) as described on online.
5694
Alternatively, you could for example use 'rettype="fasta"' to get the
5695
Fasta-format; see the EFetch Sequences Help page (7) for other options.
5696
The available formats depend on which database you are downloading from
5697
- see the main EFetch Help page (8).
5698
If you fetch the record in one of the formats accepted by 'Bio.SeqIO'
5699
(see Chapter 5), you could directly parse it into a 'SeqRecord':
5700
<<>>> from Bio import Entrez, SeqIO
5701
>>> handle = Entrez.efetch(db="nucleotide",
5702
id="186972394",rettype="gb")
5703
>>> record = SeqIO.read(handle, "genbank")
5708
Description: Selenipedium aequinoctiale maturase K (matK) gene,
5709
partial cds; chloroplast.
5710
Number of features: 3
5712
Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA...GAA',
5713
IUPACAmbiguousDNA())>>
5715
Note that a more typical use would be to save the sequence data to a
5716
local file, and then parse it with 'Bio.SeqIO'. This can save you having
5717
to re-download the same file repeatedly while working on your script,
5718
and in particular places less load on the NCBI's servers. For example:
5720
from Bio import SeqIO
5721
from Bio import Entrez
5722
Entrez.email = "A.N.Other@example.com" # Always tell NCBI who you
5724
filename = "gi_186972394.gbk"
5725
if not os.path.isfile(filename) :
5726
print "Downloading..."
5728
Entrez.efetch(db="nucleotide",id="186972394",rettype="gb")
5729
out_handle = open(filename, "w")
5730
out_handle.write(net_handle.read())
5736
record = SeqIO.read(open(filename), "genbank")
5740
To get the output in XML format, which you can parse using the
5741
'Bio.Entrez.read()' function, use 'retmode="xml"':
5742
<<>>> from Bio import Entrez
5743
>>> handle = Entrez.efetch(db="nucleotide", id="186972394",
5745
>>> record = Entrez.read(handle)
5747
>>> record[0]["GBSeq_definition"]
5748
'Selenipedium aequinoctiale maturase K (matK) gene, partial cds;
5750
>>> record[0]["GBSeq_source"]
5751
'chloroplast Selenipedium aequinoctiale'
5754
If you want to perform a search with 'Bio.Entrez.esearch()', and then
5755
download the records with 'Bio.Entrez.efetch()', you should use the
5756
WebEnv history feature -- see Section 8.13.
5759
8.7 ELink: Searching for related items in NCBI Entrez
5760
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
5763
ELink, available from Biopython as 'Bio.Entrez.elink()', can be used
5764
to find related items in the NCBI Entrez databases. For example, let's
5765
try to find articles related to the Biopython application note published
5766
in Bioinformatics in 2009. The PubMed ID of this article is 12230038.
5767
Now we use 'Bio.Entrez.elink' to find all related items to this article:
5768
<<>>> from Bio import Entrez
5769
>>> Entrez.email = "A.N.Other@example.com"
5770
>>> pmid = "19304878"
5771
>>> handle = Entrez.elink(dbfrom="pubmed", id=pmid)
5772
>>> record = Entrez.read(handle)
5776
The 'record' variable consists of a Python list, one for each database
5777
in which we searched. Since we specified only one PubMed ID to search
5778
for, 'record' contains only one item. This item is a dictionary
5779
containing information about our search term, as well as all the related
5780
items that were found:
5781
<<>>> record[0]["DbFrom"]
5783
>>> record[0]["IdList"]
5787
The '"LinkSetDb"' key contains the search results, stored as a list
5788
consisting of one item for each target database. In our search results,
5789
we only find hits in the PubMed database (although sub-diveded into
5791
<<>>> len(record[0]["LinkSetDb"])
5793
>>> for linksetdb in record[0]['LinkSetDb'] :
5794
... print linksetdb["DbTo"], linksetdb["LinkName"],
5795
len(linksetdb["Link"])
5797
pubmed pubmed_pubmed 110
5798
pubmed pubmed_pubmed_combined 6
5799
pubmed pubmed_pubmed_five 6
5800
pubmed pubmed_pubmed_reviews 5
5801
pubmed pubmed_pubmed_reviews_five 5
5804
The actual search results are stored as under the '"Link"' key. In
5805
total, 110 items were found under standard search. Let's now at the
5806
first search result:
5807
<<>>> record[0]['LinkSetDb'][0]['Link'][0]
5811
This is the article we searched for, which doesn't help us much, so
5812
let's look at the second search result:
5813
<<>>> record[0]['LinkSetDb'][0]['Link'][1]
5817
This paper, with PubMed ID 17316423, is about the Biopython PDB
5819
We can use a loop to print out all PubMed IDs:
5820
<<>>> for link in record[0]["LinkSetDb"][0]['Link'] :
5832
For help on ELink, see the ELink help page (9).
5835
8.8 EGQuery: Obtaining counts for search terms
5836
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
5838
EGQuery provides counts for a search term in each of the Entrez
5839
databases. This is particularly useful to find out how many items your
5840
search terms would find in each database without actually performing
5841
lots of separate searches with ESearch (see the example in 8.12.2
5843
In this example, we use 'Bio.Entrez.egquery()' to obtain the counts
5845
<<>>> from Bio import Entrez
5846
>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who
5848
>>> handle = Entrez.egquery(term="biopython")
5849
>>> record = Entrez.read(handle)
5850
>>> for row in record["eGQueryResult"]: print row["DbName"],
5858
See the EGQuery help page (10) for more information.
5861
8.9 ESpell: Obtaining spelling suggestions
5862
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
5864
ESpell retrieves spelling suggestions. In this example, we use
5865
'Bio.Entrez.espell()' to obtain the correct spelling of Biopython:
5866
<<>>> from Bio import Entrez
5867
>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who
5869
>>> handle = Entrez.espell(term="biopythooon")
5870
>>> record = Entrez.read(handle)
5873
>>> record["CorrectedQuery"]
5876
See the ESpell help page (11) for more information.
5879
8.10 Specialized parsers
5880
*=*=*=*=*=*=*=*=*=*=*=*=*
5883
The 'Bio.Entrez.read()' function can parse most (if not all) XML
5884
output returned by Entrez. Entrez typically allows you to retrieve
5885
records in other formats, which may have some advantages compared to the
5886
XML format in terms of readability (or download size).
5887
To request a specific file format from Entrez using
5888
'Bio.Entrez.efetch()' requires specifying the 'rettype' and/or 'retmode'
5889
optional arguments. The different combinations are described for each
5890
database type on the NCBI efetch webpage (12).
5891
One obvious case is you may prefer to download sequences in the FASTA
5892
or GenBank/GenPept plain text formats (which can then be parsed with
5893
'Bio.SeqIO', see Sections 5.2.1 and 8.6). For the literature databases,
5894
Biopython contains a parser for the 'MEDLINE' format used in PubMed.
5897
8.10.1 Parsing Medline records
5898
===============================
5899
You can find the Medline parser in 'Bio.Medline'. Suppose we want to
5900
parse the file 'pubmed_result1.txt', containing one Medline record. You
5901
can find this file in Biopython's 'Tests\Medline' directory. The file
5910
IS - 1467-5463 (Print)
5914
TI - The Bio* toolkits--a brief overview.
5916
AB - Bioinformatics research is often difficult to do with commercial
5918
Open Source BioPerl, BioPython and Biojava projects provide
5922
We first open the file and then parse it:
5923
<<>>> from Bio import Medline
5924
>>> input = open("pubmed_result1.txt")
5925
>>> record = Medline.read(input)
5927
The 'record' now contains the Medline record as a Python dictionary:
5928
<<>>> record["PMID"]
5931
'Bioinformatics research is often difficult to do with commercial
5933
The Open Source BioPerl, BioPython and Biojava projects provide
5935
multiple functionality that make it easier to create customised
5937
analysis. This review briefly compares the quirks of the underlying
5939
and the functionality, documentation, utility and relative advantages
5941
Bio counterparts, particularly from the point of view of the beginning
5942
biologist programmer.'
5944
The key names used in a Medline record can be rather obscure; use
5947
for a brief summary.
5948
To parse a file containing multiple Medline records, you can use the
5949
'parse' function instead:
5950
<<>>> from Bio import Medline
5951
>>> input = open("pubmed_result2.txt")
5952
>>> records = Medline.parse(input)
5953
>>> for record in records:
5954
... print record["TI"]
5955
A high level interface to SCOP and ASTRAL implemented in python.
5956
GenomeDiagram: a python package for the visualization of large-scale
5958
Open source clustering software.
5959
PDB file parser and structure class implemented in Python.
5962
Instead of parsing Medline records stored in files, you can also parse
5963
Medline records downloaded by 'Bio.Entrez.efetch'. For example, let's
5964
look at all Medline records in PubMed related to Biopython:
5965
<<>>> from Bio import Entrez
5966
>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who
5968
>>> handle = Entrez.esearch(db="pubmed",term="biopython")
5969
>>> record = Entrez.read(handle)
5970
>>> record["IdList"]
5971
['19304878', '18606172', '16403221', '16377612', '14871861',
5972
'14630660', '12230038']
5974
We now use 'Bio.Entrez.efetch' to download these Medline records:
5975
<<>>> idlist = record["IdList"]
5977
Entrez.efetch(db="pubmed",id=idlist,rettype="medline",retmode="text")
5979
Here, we specify 'rettype="medline", retmode="text"' to obtain the
5980
Medline records in plain-text Medline format. Now we use 'Bio.Medline'
5981
to parse these records:
5982
<<>>> from Bio import Medline
5983
>>> records = Medline.parse(handle)
5984
>>> for record in records:
5985
... print record["AU"]
5986
['Cock PJ', 'Antao T', 'Chang JT', 'Chapman BA', 'Cox CJ', 'Dalke A',
5988
['Munteanu CR', 'Gonzalez-Diaz H', 'Magalhaes AL']
5989
['Casbon JA', 'Crooks GE', 'Saqi MA']
5990
['Pritchard L', 'White JA', 'Birch PR', 'Toth IK']
5991
['de Hoon MJ', 'Imoto S', 'Nolan J', 'Miyano S']
5992
['Hamelryck T', 'Manderick B']
5996
For comparison, here we show an example using the XML format:
5997
<<>>> idlist = record["IdList"]
5999
Entrez.efetch(db="pubmed",id=idlist,rettype="medline",retmode="xml")
6000
>>> records = Entrez.read(handle)
6001
>>> for record in records:
6002
... print record["MedlineCitation"]["Article"]["ArticleTitle"]
6003
Biopython: freely available Python tools for computational molecular
6006
Enzymes/non-enzymes classification model complexity based on
6007
composition, sequence,
6008
3D and topological indices.
6009
A high level interface to SCOP and ASTRAL implemented in python.
6010
GenomeDiagram: a python package for the visualization of large-scale
6012
Open source clustering software.
6013
PDB file parser and structure class implemented in Python.
6014
The Bio* toolkits--a brief overview.
6017
Note that in both of these examples, for simplicity we have naively
6018
combined ESearch and EFetch. In this situation, the NCBI would expect
6019
you to use their history feature, as illustrated in Section 8.13.
6022
8.10.2 Parsing GEO records
6023
===========================
6025
GEO (Gene Expression Omnibus (13)) is a data repository of
6026
high-throughput gene expression and hybridization array data. The
6027
'Bio.Geo' module can be used to parse GEO-formatted data.
6028
The following code fragment shows how to parse the example GEO file
6029
'GSE16.txt' into a record and print the record:
6030
<<>>> from Bio import Geo
6031
>>> handle = open("GSE16.txt")
6032
>>> records = Geo.parse(handle)
6033
>>> for record in records:
6037
You can search the "gds" database (GEO datasets) with ESearch:
6038
<<>>> from Bio import Entrez
6039
>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who you
6041
>>> handle = Entrez.esearch(db="gds",term="GSE16")
6042
>>> record = Entrez.read(handle)
6045
>>> record["IdList"]
6046
['200000016', '100000028']
6049
From the Entrez website, UID "200000016" is GDS16 while the other hit
6050
"100000028" is for the associated platform, GPL28. Unfortunately, at the
6051
time of writing the NCBI don't seem to support downloading GEO files
6052
using Entrez (not as XML, nor in the Simple Omnibus Format in Text
6054
However, it is actually pretty straight forward to download the GEO
6055
files by FTP from ftp://ftp.ncbi.nih.gov/pub/geo/ instead. In this case
6057
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE16/GSE16_family.so
6058
ft.gz (a compressed file, see the Python module gzip).
6061
8.10.3 Parsing UniGene records
6062
===============================
6064
UniGene is an NCBI database of the transcriptome, with each UniGene
6065
record showing the set of transcripts that are associated with a
6066
particular gene in a specific organism. A typical UniGene record looks
6069
TITLE N-acetyltransferase 2 (arylamine N-acetyltransferase)
6075
EXPRESS bone| connective tissue| intestine| liver| liver tumor|
6076
normal| soft tissue/muscle tissue tumor| adult
6079
STS ACC=PMC310725P3 UNISTS=272646
6080
STS ACC=WIAF-2120 UNISTS=44576
6081
STS ACC=G59899 UNISTS=137181
6083
STS ACC=GDB:187676 UNISTS=155563
6084
PROTSIM ORG=10090; PROTGI=6754794; PROTID=NP_035004.1; PCT=76.55;
6086
PROTSIM ORG=9796; PROTGI=149742490; PROTID=XP_001487907.1;
6088
PROTSIM ORG=9986; PROTGI=126722851; PROTID=NP_001075655.1;
6091
PROTSIM ORG=9598; PROTGI=114619004; PROTID=XP_519631.2; PCT=98.28;
6095
SEQUENCE ACC=BC067218.1; NID=g45501306; PID=g45501307; SEQTYPE=mRNA
6096
SEQUENCE ACC=NM_000015.2; NID=g116295259; PID=g116295260;
6098
SEQUENCE ACC=D90042.1; NID=g219415; PID=g219416; SEQTYPE=mRNA
6099
SEQUENCE ACC=D90040.1; NID=g219411; PID=g219412; SEQTYPE=mRNA
6100
SEQUENCE ACC=BC015878.1; NID=g16198419; PID=g16198420; SEQTYPE=mRNA
6101
SEQUENCE ACC=CR407631.1; NID=g47115198; PID=g47115199; SEQTYPE=mRNA
6102
SEQUENCE ACC=BG569293.1; NID=g13576946; CLONE=IMAGE:4722596;
6103
END=5'; LID=6989; SEQTYPE=EST; TRACE=44157214
6105
SEQUENCE ACC=AU099534.1; NID=g13550663; CLONE=HSI08034; END=5';
6106
LID=8800; SEQTYPE=EST
6110
This particular record shows the set of transcripts (shown in the
6111
'SEQUENCE' lines) that originate from the human gene NAT2, encoding en
6112
N-acetyltransferase. The 'PROTSIM' lines show proteins with significant
6113
similarity to NAT2, whereas the 'STS' lines show the corresponding
6114
sequence-tagged sites in the genome.
6115
To parse UniGene files, use the 'Bio.UniGene' module:
6116
<<>>> from Bio import UniGene
6117
>>> input = open("myunigenefile.data")
6118
>>> record = UniGene.read(input)
6121
The 'record' returned by 'UniGene.read' is a Python object with
6122
attributes corresponding to the fields in the UniGene record. For
6127
"N-acetyltransferase 2 (arylamine N-acetyltransferase)"
6130
The 'EXPRESS' and 'RESTR_EXPR' lines are stored as Python lists of
6132
<<['bone', 'connective tissue', 'intestine', 'liver', 'liver tumor',
6133
'normal', 'soft tissue/muscle tissue tumor', 'adult']
6136
Specialized objects are returned for the 'STS', 'PROTSIM', and
6137
'SEQUENCE' lines, storing the keys shown in each line as attributes:
6138
<<>>> record.sts[0].acc
6140
>>> record.sts[0].unists
6143
and similarly for the 'PROTSIM' and 'SEQUENCE' lines.
6144
To parse a file containing more than one UniGene record, use the
6145
'parse' function in 'Bio.UniGene':
6146
<<>>> from Bio import UniGene
6147
>>> input = open("unigenerecords.data")
6148
>>> records = UniGene.parse(input)
6149
>>> for record in records:
6159
Normally you won't have to worry about using a proxy, but if this is
6160
an issue on your network here is how to deal with it. Internally,
6161
'Bio.Entrez' uses the standard Python library 'urllib' for accessing the
6162
NCBI servers. This will check an environment variable called
6163
'http_proxy' to configure any simple proxy automatically. Unfortunately
6164
this module does not support the use of proxies which require
6166
You may choose to set the 'http_proxy' environment variable once (how
6167
you do this will depend on your operating system). Alternatively you can
6168
set this within Python at the start of your script, for example:
6170
os.environ["http_proxy"] = "http://proxyhost.example.com:8080"
6173
See the urllib documentation (14) for more details.
6182
8.12.1 PubMed and Medline
6183
==========================
6185
If you are in the medical field or interested in human issues (and
6186
many times even if you are not!), PubMed
6187
(http://www.ncbi.nlm.nih.gov/PubMed/) is an excellent source of all
6188
kinds of goodies. So like other things, we'd like to be able to grab
6189
information from it and use it in Python scripts.
6190
In this example, we will query PubMed for all articles having to do
6191
with orchids (see section 2.3 for our motivation). We first check how
6192
many of such articles there are:
6193
<<>>> from Bio import Entrez
6194
>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who
6196
>>> handle = Entrez.egquery(term="orchid")
6197
>>> record = Entrez.read(handle)
6198
>>> for row in record["eGQueryResult"]:
6199
... if row["DbName"]=="pubmed":
6200
... print row["Count"]
6204
Now we use the 'Bio.Entrez.efetch' function to download the PubMed IDs
6205
of these 463 articles:
6206
<<>>> handle = Entrez.esearch(db="pubmed", term="orchid", retmax=463)
6207
>>> record = Entrez.read(handle)
6208
>>> idlist = record["IdList"]
6212
This returns a Python list containing all of the PubMed IDs of
6213
articles related to orchids:
6214
<<['18680603', '18665331', '18661158', '18627489', '18627452',
6216
'18594007', '18591784', '18589523', '18579475', '18575811',
6221
Now that we've got them, we obviously want to get the corresponding
6222
Medline records and extract the information from them. Here, we'll
6223
download the Medline records in the Medline flat-file format, and use
6224
the 'Bio.Medline' module to parse them:
6225
<<>>> from Bio import Medline
6226
>>> handle = Entrez.efetch(db="pubmed", id=idlist, rettype="medline",
6228
>>> records = Medline.parse(handle)
6231
NOTE - We've just done a separate search and fetch here, the NCBI much
6232
prefer you to take advantage of their history support in this situation.
6234
Keep in mind that 'records' is an iterator, so you can iterate through
6235
the records only once. If you want to save the records, you can convert
6237
<<>>> records = list(records)
6240
Let's now iterate over the records to print out some information about
6242
<<>>> for record in records:
6243
... print "title:", record["TI"]
6244
... if "AU" in records:
6245
... print "authors:", record["AU"]
6246
... print "source:", record["CO"]
6250
The output for this looks like:
6251
<<title: Sex pheromone mimicry in the early spider orchid (ophrys
6253
patterns of hydrocarbons as the key mechanism for pollination by
6255
deception [In Process Citation]
6256
authors: ['Schiestl FP', 'Ayasse M', 'Paulus HF', 'Lofstedt C',
6258
'Ibarra F', 'Francke W']
6259
source: J Comp Physiol [A] 2000 Jun;186(6):567-74
6262
Especially interesting to note is the list of authors, which is
6263
returned as a standard Python list. This makes it easy to manipulate and
6264
search using standard Python tools. For instance, we could loop through
6265
a whole bunch of entries searching for a particular author with code
6267
<<>>> search_author = "Waits T"
6269
>>> for record in records:
6270
... if not "AU" in record:
6272
... if search_author in record[" AU"]:
6273
... print "Author %s found: %s" % (search_author,
6277
Hopefully this section gave you an idea of the power and flexibility
6278
of the Entrez and Medline interfaces and how they can be used together.
6281
8.12.2 Searching, downloading, and parsing Entrez Nucleotide records
6282
=====================================================================
6284
Here we'll show a simple example of performing a remote Entrez query.
6285
In section 2.3 of the parsing examples, we talked about using NCBI's
6286
Entrez website to search the NCBI nucleotide databases for info on
6287
Cypripedioideae, our friends the lady slipper orchids. Now, we'll look
6288
at how to automate that process using a Python script. In this example,
6289
we'll just show how to connect, get the results, and parse them, with
6290
the Entrez module doing all of the work.
6291
First, we use EGQuery to find out the number of results we will get
6292
before actually downloading them. EGQuery will tell us how many search
6293
results were found in each of the databases, but for this example we are
6294
only interested in nucleotides:
6295
<<>>> from Bio import Entrez
6296
>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who
6298
>>> handle = Entrez.egquery(term="Cypripedioideae")
6299
>>> record = Entrez.read(handle)
6300
>>> for row in record["eGQueryResult"]:
6301
... if row["DbName"]=="nuccore":
6302
... print row["Count"]
6306
So, we expect to find 814 Entrez Nucleotide records (this is the
6307
number I obtained in 2008; it is likely to increase in the future). If
6308
you find some ridiculously high number of hits, you may want to
6309
reconsider if you really want to download all of them, which is our next
6311
<<>>> from Bio import Entrez
6312
>>> handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae",
6314
>>> record = Entrez.read(handle)
6317
Here, 'record' is a Python dictionary containing the search results
6318
and some auxiliary information. Just for information, let's look at what
6319
is stored in this dictionary:
6320
<<>>> print record.keys()
6321
[u'Count', u'RetMax', u'IdList', u'TranslationSet', u'RetStart',
6322
u'QueryTranslation']
6324
First, let's check how many results were found:
6325
<<>>> print record["Count"]
6328
which is the number we expected. The 814 results are stored in
6330
<<>>> print len(record["IdList"])
6333
Let's look at the first five results:
6334
<<>>> print record["IdList"][:5]
6335
['187237168', '187372713', '187372690', '187372688', '187372686']
6338
We can download these records using 'efetch'. While you could
6339
download these records one by one, to reduce the load on NCBI's servers,
6340
it is better to fetch a bunch of records at the same time, shown below.
6341
However, in this situation you should ideally be using the history
6342
feature described later in Section 8.13.
6343
<<>>> idlist = ",".join(record["IdList"][:5])
6345
187237168,187372713,187372690,187372688,187372686
6346
>>> handle = Entrez.efetch(db="nucleotide", id=idlist, retmode="xml")
6347
>>> records = Entrez.read(handle)
6348
>>> print len(records)
6351
Each of these records corresponds to one GenBank record.
6352
<<>>> print records[0].keys()
6353
[u'GBSeq_moltype', u'GBSeq_source', u'GBSeq_sequence',
6354
u'GBSeq_primary-accession', u'GBSeq_definition',
6355
u'GBSeq_accession-version',
6356
u'GBSeq_topology', u'GBSeq_length', u'GBSeq_feature-table',
6357
u'GBSeq_create-date', u'GBSeq_other-seqids', u'GBSeq_division',
6358
u'GBSeq_taxonomy', u'GBSeq_references', u'GBSeq_update-date',
6359
u'GBSeq_organism', u'GBSeq_locus', u'GBSeq_strandedness']
6361
>>> print records[0]["GBSeq_primary-accession"]
6364
>>> print records[0]["GBSeq_other-seqids"]
6365
['gb|DQ110336.1|', 'gi|187237168']
6367
>>> print records[0]["GBSeq_definition"]
6368
Cypripedium calceolus voucher Davis 03-03 A maturase (matR) gene,
6372
>>> print records[0]["GBSeq_organism"]
6373
Cypripedium calceolus
6376
You could use this to quickly set up searches -- but for heavy usage,
6380
8.12.3 Searching, downloading, and parsing GenBank records
6381
===========================================================
6383
The GenBank record format is a very popular method of holding
6384
information about sequences, sequence features, and other associated
6385
sequence information. The format is a good way to get information from
6386
the NCBI databases at http://www.ncbi.nlm.nih.gov/.
6387
In this example we'll show how to query the NCBI databases,to retrieve
6388
the records from the query, and then parse them using 'Bio.SeqIO' -
6389
something touched on in Section 5.2.1. For simplicity, this example does
6390
not take advantage of the WebEnv history feature -- see Section 8.13 for
6392
First, we want to make a query and find out the ids of the records to
6393
retrieve. Here we'll do a quick search for one of our favorite
6394
organisms, Opuntia (prickly-pear cacti). We can do quick search and get
6395
back the GIs (GenBank identifiers) for all of the corresponding records.
6396
First we check how many records there are:
6397
<<>>> from Bio import Entrez
6398
>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who
6400
>>> handle = Entrez.egquery(term="Opuntia AND rpl16")
6401
>>> record = Entrez.read(handle)
6402
>>> for row in record["eGQueryResult"]:
6403
... if row["DbName"]=="nuccore":
6404
... print row["Count"]
6408
Now we download the list of GenBank identifiers:
6409
<<>>> handle = Entrez.esearch(db="nuccore", term="Opuntia AND rpl16")
6410
>>> record = Entrez.read(handle)
6411
>>> gi_list = record["IdList"]
6413
['57240072', '57240071', '6273287', '6273291', '6273290', '6273289',
6415
'6273285', '6273284']
6418
Now we use these GIs to download the GenBank records - note that you
6419
have to supply a comma separated list of GI numbers to Entrez:
6420
<<>>> gi_str = ",".join(gi_list)
6421
>>> handle = Entrez.efetch(db="nuccore", id=gi_str, rettype="gb")
6424
If you want to look at the raw GenBank files, you can read from this
6425
handle and print out the result:
6426
<<>>> text = handle.read()
6428
LOCUS AY851612 892 bp DNA linear PLN
6430
DEFINITION Opuntia subulata rpl16 gene, intron; chloroplast.
6432
VERSION AY851612.1 GI:57240072
6434
SOURCE chloroplast Austrocylindropuntia subulata
6435
ORGANISM Austrocylindropuntia subulata
6436
Eukaryota; Viridiplantae; Streptophyta; Embryophyta;
6438
Spermatophyta; Magnoliophyta; eudicotyledons; core
6440
Caryophyllales; Cactaceae; Opuntioideae;
6441
Austrocylindropuntia.
6442
REFERENCE 1 (bases 1 to 892)
6443
AUTHORS Butterworth,C.A. and Wallace,R.S.
6447
In this case, we are just getting the raw records. To get the records
6448
in a more Python-friendly form, we can use 'Bio.SeqIO' to parse the
6449
GenBank data into 'SeqRecord' objects, including 'SeqFeature' objects
6451
<<>>> from Bio import SeqIO
6452
>>> handle = Entrez.efetch(db="nuccore", id=gi_str, rettype="gb")
6453
>>> records = SeqIO.parse(handle, "gb")
6456
We can now step through the records and look at the information we are
6458
<<>>> for record in records:
6459
>>> ... print "%s, length %i, with %i features" \
6460
>>> ... % (record.name, len(record), len(record.features))
6461
AY851612, length 892, with 3 features
6462
AY851611, length 881, with 3 features
6463
AF191661, length 895, with 3 features
6464
AF191665, length 902, with 3 features
6465
AF191664, length 899, with 3 features
6466
AF191663, length 899, with 3 features
6467
AF191660, length 893, with 3 features
6468
AF191659, length 894, with 3 features
6469
AF191658, length 896, with 3 features
6472
Using these automated query retrieval functionality is a big plus over
6473
doing things by hand. Although the module should obey the NCBI's max
6474
three queries per second rule, the NCBI have other recommendations like
6475
avoiding peak hours. See Section 8.1. In particular, please note that
6476
for simplicity, this example does not use the WebEnv history feature.
6477
You should use this for any non-trivial search and download work, see
6479
Finally, if plan to repeat your analysis, rather than downloading the
6480
files from the NCBI and parsing them immediately (as shown in this
6481
example), you should just download the records once and save them to
6482
your hard disk, and then parse the local file.
6485
8.12.4 Finding the lineage of an organism
6486
==========================================
6488
Staying with a plant example, let's now find the lineage of the
6489
Cypripedioideae orchid family. First, we search the Taxonomy database
6490
for Cypripedioideae, which yields exactly one NCBI taxonomy identifier:
6491
<<>>> from Bio import Entrez
6492
>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who
6494
>>> handle = Entrez.esearch(db="Taxonomy", term="Cypripedioideae")
6495
>>> record = Entrez.read(handle)
6496
>>> record["IdList"]
6498
>>> record["IdList"][0]
6501
Now, we use 'efetch' to download this entry in the Taxonomy database,
6503
<<>>> handle = Entrez.efetch(db="Taxonomy", id="158330", retmode="xml")
6504
>>> records = Entrez.read(handle)
6506
Again, this record stores lots of information:
6507
<<>>> records[0].keys()
6508
[u'Lineage', u'Division', u'ParentTaxId', u'PubDate', u'LineageEx',
6509
u'CreateDate', u'TaxId', u'Rank', u'GeneticCode', u'ScientificName',
6510
u'MitoGeneticCode', u'UpdateDate']
6512
We can get the lineage directly from this record:
6513
<<>>> records[0]["Lineage"]
6514
'cellular organisms; Eukaryota; Viridiplantae; Streptophyta;
6516
Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta;
6518
Liliopsida; Asparagales; Orchidaceae'
6521
The record data contains much more than just the information shown
6522
here - for example look under "LineageEx" instead of "Lineage" and
6523
you'll get the NCBI taxon identifiers of the lineage entries too.
6526
8.13 Using the history and WebEnv
6527
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
6530
Often you will want to make a series of linked queries. Most
6531
typically, running a search, perhaps refining the search, and then
6532
retrieving detailed search results. You can do this by making a series
6533
of separate calls to Entrez. However, the NCBI prefer you to take
6534
advantage of their history support - for example combining ESearch and
6536
Another typical use of the history support would be to combine EPost
6537
and EFetch. You use EPost to upload a list of identifiers, which starts
6538
a new history session. You then download the records with EFetch by
6539
referring to the session (instead of the identifiers).
6542
8.13.1 Searching for and downloading sequences using the history
6543
=================================================================
6544
Suppose we want to search and download all the Opuntia rpl16
6545
nucleotide sequences, and store them in a FASTA file. As shown in
6546
Section 8.12.3, we can naively combine 'Bio.Entrez.esearch()' to get a
6547
list of GI numbers, and then call 'Bio.Entrez.efetch()' to download them
6549
However, the approved approach is to run the search with the history
6550
feature. Then, we can fetch the results by reference to the search
6551
results - which the NCBI can anticipate and cache.
6552
To do this, call 'Bio.Entrez.esearch()' as normal, but with the
6553
additional argument of 'usehistory="y"',
6554
<<>>> from Bio import Entrez
6555
>>> Entrez.email = "history.user@example.com"
6556
>>> search_handle = Entrez.esearch(db="nucleotide",term="Opuntia[orgn]
6559
>>> search_results = Entrez.read(search_handle)
6560
>>> search_handle.close()
6563
When you get the XML output back, it will still include the usual
6565
<<>>> gi_list = search_results["IdList"]
6566
>>> count = int(search_results["Count"])
6567
>>> assert count == len(gi_list)
6570
However, you also get given two additional pieces of information, the
6571
WebEnv session cookie, and the QueryKey:
6572
<<>>> webenv = search_results["WebEnv"]
6573
>>> query_key = search_results["QueryKey"]
6576
Having stored these values in variables session_cookie and query_key
6577
we can use them as parameters to 'Bio.Entrez.efetch()' instead of giving
6578
the GI numbers as identifiers.
6579
While for small searches you might be OK downloading everything at
6580
once, its better download in batches. You use the retstart and retmax
6581
parameters to specify which range of search results you want returned
6582
(starting entry using zero-based counting, and maximum number of results
6583
to return). For example,
6585
out_handle = open("orchid_rpl16.fasta", "w")
6586
for start in range(0,count,batch_size) :
6587
end = min(count, start+batch_size)
6588
print "Going to download record %i to %i" % (start+1, end)
6589
fetch_handle = Entrez.efetch(db="nucleotide", rettype="fasta",
6590
retstart=start, retmax=batch_size,
6591
webenv=webenv, query_key=query_key)
6592
data = fetch_handle.read()
6593
fetch_handle.close()
6594
out_handle.write(data)
6598
For illustrative purposes, this example downloaded the FASTA records
6599
in batches of three. Unless you are downloading genomes or chromosomes,
6600
you would normally pick a larger batch size.
6603
8.13.2 Searching for and downloading abstracts using the history
6604
=================================================================
6605
Here is another history example, searching for papers published in
6606
the last year about the Opuntia, and then downloading them into a file
6608
<<from Bio import Entrez
6609
Entrez.email = "history.user@example.com"
6610
search_results = Entrez.read(Entrez.esearch(db="pubmed",
6611
term="Opuntia[ORGN]",
6615
count = int(search_results["Count"])
6616
print "Found %i results" % count
6619
out_handle = open("recent_orchid_papers.txt", "w")
6620
for start in range(0,count,batch_size) :
6621
end = min(count, start+batch_size)
6622
print "Going to download record %i to %i" % (start+1, end)
6623
fetch_handle = Entrez.efetch(db="pubmed", rettype="medline",
6624
retstart=start, retmax=batch_size,
6625
webenv=search_results["WebEnv"],
6626
query_key=search_results["QueryKey"])
6627
data = fetch_handle.read()
6628
fetch_handle.close()
6629
out_handle.write(data)
6633
At the time of writing, this gave 28 matches - but because this is a
6634
date dependent search, this will of course vary. As described in
6635
Section 8.10.1 above, you can then use 'Bio.Medline' to parse the saved
6637
And finally, don't forget to include your own email address in the
6639
-----------------------------------
6642
(1) http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#Us
6643
erSystemRequirements
6645
(2) http://www.ncbi.nlm.nih.gov/entrez/query/static/esearch_help.html
6647
(3) http://www.ncbi.nlm.nih.gov/entrez/query/static/epost_help.html
6649
(4) http://www.ncbi.nlm.nih.gov/entrez/query/static/esummary_help.html
6651
(5) http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html
6653
(6) http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html
6655
(7) http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html
6657
(8) http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html
6659
(9) http://www.ncbi.nlm.nih.gov/entrez/query/static/elink_help.html
6661
(10) http://www.ncbi.nlm.nih.gov/entrez/query/static/egquery_help.html
6663
(11) http://www.ncbi.nlm.nih.gov/entrez/query/static/espell_help.html
6665
(12) http://www.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html
6667
(13) http://www.ncbi.nlm.nih.gov/geo/
6669
(14) http://www.python.org/doc/lib/module-urllib.html
6672
Chapter 9 Swiss-Prot and ExPASy
6673
**********************************
6677
9.1 Parsing Swiss-Prot files
6678
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
6681
Swiss-Prot (http://www.expasy.org/sprot) is a hand-curated database of
6682
protein sequences. Biopython can parse the "plain text" Swiss-Prot file
6683
format, which is still used for the UniProt Knowledgebase which combined
6684
Swiss-Prot, TrEMBL and PIR-PSD. We do not (yet) support the UniProtKB
6688
9.1.1 Parsing Swiss-Prot records
6689
=================================
6691
In Section 5.2.2, we described how to extract the sequence of a
6692
Swiss-Prot record as a 'SeqRecord' object. Alternatively, you can store
6693
the Swiss-Prot record in a 'Bio.SwissProt.Record' object, which in fact
6694
stores the complete information contained in the Swiss-Prot record. In
6695
this Section, we describe how to extract 'Bio.SwissProt.Record' objects
6696
from a Swiss-Prot file.
6697
To parse a Swiss-Prot record, we first get a handle to a Swiss-Prot
6698
record. There are several ways to do so, depending on where and how the
6699
Swiss-Prot record is stored:
6701
- Open a Swiss-Prot file locally:
6702
'>>> handle = open("myswissprotfile.dat")'
6703
- Open a gzipped Swiss-Prot file:
6705
>>> handle = gzip.open("myswissprotfile.dat.gz")
6708
- Open a Swiss-Prot file over the internet:
6711
urllib.urlopen("http://www.somelocation.org/data/someswissprotfile.da
6715
- Open a Swiss-Prot file over the internet from the ExPASy database
6716
(see section 9.5.1):
6717
<<>>> from Bio import ExPASy
6718
>>> handle = ExPASy.get_sprot_raw(myaccessionnumber)
6720
The key point is that for the parser, it doesn't matter how the
6721
handle was created, as long as it points to data in the Swiss-Prot
6723
We can use 'Bio.SeqIO' as described in Section 5.2.2 to get file
6724
format agnostic 'SeqRecord' objects. Alternatively, we can use
6725
'Bio.SwissProt' get 'Bio.SwissProt.Record' objects, which are a much
6726
closer match to the underlying file format.
6727
To read one Swiss-Prot record from the handle, we use the function
6729
<<>>> from Bio import SwissProt
6730
>>> record = SwissProt.read(handle)
6732
This function should be used if the handle points to exactly one
6733
Swiss-Prot record. It raises a 'ValueError' if no Swiss-Prot record was
6734
found, and also if more than one record was found.
6735
We can now print out some information about this record:
6736
<<>>> print record.description
6737
'RecName: Full=Chalcone synthase 3; EC=2.3.1.74; AltName:
6738
Full=Naringenin-chalcone synthase 3;'
6739
>>> for ref in record.references:
6740
... print "authors:", ref.authors
6741
... print "title:", ref.title
6743
authors: Liew C.F., Lim S.H., Loh C.S., Goh C.J.;
6744
title: "Molecular cloning and sequence analysis of chalcone synthase
6746
Bromheadia finlaysoniana.";
6747
>>> print record.organism_classification
6748
['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', ...,
6752
To parse a file that contains more than one Swiss-Prot record, we use
6753
the 'parse' function instead. This function allows us to iterate over
6754
the records in the file.
6755
For example, let's parse the full Swiss-Prot database and collect all
6756
the descriptions. You can download this from the ExPAYs FTP site (1) as
6757
a single gzipped-file 'uniprot_sprot.dat.gz' (about 300MB). This is a
6758
compressed file containing a single file, 'uniprot_sprot.dat' (over
6760
As described at the start of this section, you can use the Python
6761
library 'gzip' to open and uncompress a .gz file, like this:
6763
>>> handle = gzip.open("uniprot_sprot.dat.gz")
6766
However, uncompressing a large file takes time, and each time you open
6767
the file for reading in this way, it has to be decompressed on the fly.
6768
So, if you can spare the disk space you'll save time in the long run if
6769
you first decompress the file to disk, to get the 'uniprot_sprot.dat'
6770
file inside. Then you can open the file for reading as usual:
6771
<<>>> handle = open("uniprot_sprot.dat")
6774
As of June 2009, the full Swiss-Prot database downloaded from ExPASy
6775
contained 468851 Swiss-Prot records. One concise way to build up a list
6776
of the record descriptions is with a list comprehension:
6777
<<>>> from Bio import SwissProt
6778
>>> handle = open("uniprot_sprot.dat")
6779
>>> descriptions = [record.description for record in
6780
SwissProt.parse(handle)]
6781
>>> len(descriptions)
6783
>>> descriptions[:5]
6784
['RecName: Full=Protein MGF 100-1R;',
6785
'RecName: Full=Protein MGF 100-1R;',
6786
'RecName: Full=Protein MGF 100-1R;',
6787
'RecName: Full=Protein MGF 100-1R;',
6788
'RecName: Full=Protein MGF 100-2L;']
6792
Or, using a for loop over the record iterator:
6793
<<>>> from Bio import SwissProt
6794
>>> descriptions = []
6795
>>> handle = open("uniprot_sprot.dat")
6796
>>> for record in SwissProt.parse(handle) :
6797
... descriptions.append(record.description)
6799
>>> len(descriptions)
6803
Because this is such a large input file, either way takes about eleven
6804
minutes on my new desktop computer (using the uncompressed
6805
'uniprot_sprot.dat' file as input).
6806
It is equally easy to extract any kind of information you'd like from
6807
Swiss-Prot records. To see the members of a Swiss-Prot record, use
6809
['__doc__', '__init__', '__module__', 'accessions',
6810
'annotation_update',
6811
'comments', 'created', 'cross_references', 'data_class',
6813
'entry_name', 'features', 'gene_name', 'host_organism', 'keywords',
6814
'molecule_type', 'organelle', 'organism', 'organism_classification',
6815
'references', 'seqinfo', 'sequence', 'sequence_length',
6816
'sequence_update', 'taxonomy_id']
6821
9.1.2 Parsing the Swiss-Prot keyword and category list
6822
=======================================================
6824
Swiss-Prot also distributes a file 'keywlist.txt', which lists the
6825
keywords and categories used in Swiss-Prot. The file contains entries in
6829
DE Protein which contains at least one 2Fe-2S iron-sulfur cluster: 2
6831
DE atoms complexed to 2 inorganic sulfides and 4 sulfur atoms of
6832
DE cysteines from the protein.
6833
SY Fe2S2; [2Fe-2S] cluster; [Fe2S2] cluster; Fe2/S2 (inorganic)
6835
SY Di-mu-sulfido-diiron; 2 iron, 2 sulfur cluster binding.
6836
GO GO:0051537; 2 iron, 2 sulfur cluster binding
6837
HI Ligand: Iron; Iron-sulfur; 2Fe-2S.
6838
HI Ligand: Metal-binding; 2Fe-2S.
6843
DE Protein, or part of a protein, whose three-dimensional structure
6845
DE been resolved experimentally (for example by X-ray
6847
DE NMR spectroscopy) and whose coordinates are available in the PDB
6848
DE database. Can also be used for theoretical models.
6849
HI Technical term: 3D-structure.
6856
The entries in this file can be parsed by the 'parse' function in the
6857
'Bio.SwissProt.KeyWList' module. Each entry is then stored as a
6858
'Bio.SwissProt.KeyWList.Record', which is a Python dictionary.
6859
<<>>> from Bio.SwissProt import KeyWList
6860
>>> handle = open("keywlist.txt")
6861
>>> records = KeyWList.parse(handle)
6862
>>> for record in records:
6863
... print record['ID']
6864
... print record['DE']
6869
Protein which contains at least one 2Fe-2S iron-sulfur cluster: 2 iron
6871
complexed to 2 inorganic sulfides and 4 sulfur atoms of cysteines from
6879
9.2 Parsing Prosite records
6880
*=*=*=*=*=*=*=*=*=*=*=*=*=*=
6883
Prosite is a database containing protein domains, protein families,
6884
functional sites, as well as the patterns and profiles to recognize
6885
them. Prosite was developed in parallel with Swiss-Prot. In Biopython, a
6886
Prosite record is represented by the 'Bio.ExPASy.Prosite.Record' class,
6887
whose members correspond to the different fields in a Prosite record.
6888
In general, a Prosite file can contain more than one Prosite records.
6889
For example, the full set of Prosite records, which can be downloaded as
6890
a single file ('prosite.dat') from the ExPASy FTP site (2), contains
6891
2073 records (version 20.24 released on 4 December 2007). To parse such
6892
a file, we again make use of an iterator:
6893
<<>>> from Bio.ExPASy import Prosite
6894
>>> handle = open("myprositefile.dat")
6895
>>> records = Prosite.parse(handle)
6898
We can now take the records one at a time and print out some
6899
information. For example, using the file containing the complete Prosite
6901
<<>>> from Bio.ExPASy import Prosite
6902
>>> handle = open("prosite.dat")
6903
>>> records = Prosite.parse(handle)
6904
>>> record = records.next()
6905
>>> record.accession
6911
>>> record = records.next()
6912
>>> record.accession
6918
>>> record = records.next()
6919
>>> record.accession
6926
and so on. If you're interested in how many Prosite records there are,
6928
<<>>> from Bio.ExPASy import Prosite
6929
>>> handle = open("prosite.dat")
6930
>>> records = Prosite.parse(handle)
6932
>>> for record in records: n+=1
6938
To read exactly one Prosite from the handle, you can use the 'read'
6940
<<>>> from Bio.ExPASy import Prosite
6941
>>> handle = open("mysingleprositerecord.dat")
6942
>>> record = Prosite.read(handle)
6944
This function raises a ValueError if no Prosite record is found, and
6945
also if more than one Prosite record is found.
6948
9.3 Parsing Prosite documentation records
6949
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
6952
In the Prosite example above, the 'record.pdoc' accession numbers
6953
''PDOC00001'', ''PDOC00004'', ''PDOC00005'' and so on refer to Prosite
6954
documentation. The Prosite documentation records are available from
6955
ExPASy as individual files, and as one file ('prosite.doc') containing
6956
all Prosite documentation records.
6957
We use the parser in 'Bio.ExPASy.Prodoc' to parse Prosite
6958
documentation records. For example, to create a list of all accession
6959
numbers of Prosite documentation record, you can use
6960
<<>>> from Bio.ExPASy import Prodoc
6961
>>> handle = open("prosite.doc")
6962
>>> records = Prodoc.parse(handle)
6963
>>> accessions = [record.accession for record in records]
6966
Again a 'read()' function is provided to read exactly one Prosite
6967
documentation record from the handle.
6970
9.4 Parsing Enzyme records
6971
*=*=*=*=*=*=*=*=*=*=*=*=*=*
6974
ExPASy's Enzyme database is a repository of information on enzyme
6975
nomenclature. A typical Enzyme record looks as follows:
6977
DE Lipoprotein lipase.
6978
AN Clearing factor lipase.
6979
AN Diacylglycerol lipase.
6980
AN Diglyceride lipase.
6981
CA Triacylglycerol + H(2)O = diacylglycerol + a carboxylate.
6982
CC -!- Hydrolyzes triacylglycerols in chylomicrons and very
6984
CC lipoproteins (VLDL).
6985
CC -!- Also hydrolyzes diacylglycerol.
6986
PR PROSITE; PDOC00110;
6987
DR P11151, LIPL_BOVIN ; P11153, LIPL_CAVPO ; P11602, LIPL_CHICK ;
6988
DR P55031, LIPL_FELCA ; P06858, LIPL_HUMAN ; P11152, LIPL_MOUSE ;
6989
DR O46647, LIPL_MUSVI ; P49060, LIPL_PAPAN ; P49923, LIPL_PIG ;
6990
DR Q06000, LIPL_RAT ; Q29524, LIPL_SHEEP ;
6994
In this example, the first line shows the EC (Enzyme Commission)
6995
number of lipoprotein lipase (second line). Alternative names of
6996
lipoprotein lipase are "clearing factor lipase", "diacylglycerol
6997
lipase", and "diglyceride lipase" (lines 3 through 5). The line starting
6998
with "CA" shows the catalytic activity of this enzyme. Comment lines
6999
start with "CC". The "PR" line shows references to the Prosite
7000
Documentation records, and the "DR" lines show references to Swiss-Prot
7001
records. Not of these entries are necessarily present in an Enzyme
7003
In Biopython, an Enzyme record is represented by the
7004
'Bio.ExPASy.Enzyme.Record' class. This record derives from a Python
7005
dictionary and has keys corresponding to the two-letter codes used in
7006
Enzyme files. To read an Enzyme file containing one Enzyme record, use
7007
the 'read' function in 'Bio.ExPASy.Enzyme':
7008
<<>>> from Bio.ExPASy import Enzyme
7009
>>> handle = open("lipoprotein.txt")
7010
>>> record = Enzyme.read(handle)
7014
'Lipoprotein lipase.'
7016
['Clearing factor lipase.', 'Diacylglycerol lipase.', 'Diglyceride
7019
'Triacylglycerol + H(2)O = diacylglycerol + a carboxylate.'
7021
['Hydrolyzes triacylglycerols in chylomicrons and very low-density
7023
(VLDL).', 'Also hydrolyzes diacylglycerol.']
7027
[['P11151', 'LIPL_BOVIN'], ['P11153', 'LIPL_CAVPO'], ['P11602',
7029
['P55031', 'LIPL_FELCA'], ['P06858', 'LIPL_HUMAN'], ['P11152',
7031
['O46647', 'LIPL_MUSVI'], ['P49060', 'LIPL_PAPAN'], ['P49923',
7033
['Q06000', 'LIPL_RAT'], ['Q29524', 'LIPL_SHEEP']]
7035
The 'read' function raises a ValueError if no Enzyme record is found,
7036
and also if more than one Enzyme record is found.
7037
The full set of Enzyme records can be downloaded as a single file
7038
('enzyme.dat') from the ExPASy FTP site (3), containing 4877 records
7039
(release of 3 March 2009). To parse such a file containing multiple
7040
Enzyme records, use the 'parse' function in 'Bio.ExPASy.Enzyme' to
7042
<<>>> from Bio.ExPASy import Enzyme
7043
>>> handle = open("enzyme.dat")
7044
>>> records = Enzyme.parse(handle)
7047
We can now iterate over the records one at a time. For example, we can
7048
make a list of all EC numbers for which an Enzyme record is available:
7049
<<>>> ecnumbers = [record["ID"] for record in records]
7054
9.5 Accessing the ExPASy server
7055
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
7058
Swiss-Prot, Prosite, and Prosite documentation records can be
7059
downloaded from the ExPASy web server at http://www.expasy.org. Six
7060
kinds of queries are available from ExPASy:
7062
get_prodoc_entry To download a Prosite documentation record in HTML
7064
get_prosite_entry To download a Prosite record in HTML format
7065
get_prosite_raw To download a Prosite or Prosite documentation record
7067
get_sprot_raw To download a Swiss-Prot record in raw format
7068
sprot_search_ful To search for a Swiss-Prot record
7069
sprot_search_de To search for a Swiss-Prot record
7070
To access this web server from a Python script, we use the
7071
'Bio.ExPASy' module.
7074
9.5.1 Retrieving a Swiss-Prot record
7075
=====================================
7077
Let's say we are looking at chalcone synthases for Orchids (see
7078
section 2.3 for some justification for looking for interesting things
7079
about orchids). Chalcone synthase is involved in flavanoid biosynthesis
7080
in plants, and flavanoids make lots of cool things like pigment colors
7082
If you do a search on Swiss-Prot, you can find three orchid proteins
7083
for Chalcone Synthase, id numbers O23729, O23730, O23731. Now, let's
7084
write a script which grabs these, and parses out some interesting
7086
First, we grab the records, using the 'get_sprot_raw()' function of
7087
'Bio.ExPASy'. This function is very nice since you can feed it an id and
7088
get back a handle to a raw text record (no html to mess with!). We can
7089
the use 'Bio.SwissProt.read' to pull out the Swiss-Prot record, or
7090
'Bio.SeqIO.read' to get a SeqRecord. The following code accomplishes
7092
<<>>> from Bio import ExPASy
7093
>>> from Bio import SwissProt
7095
>>> accessions = ["O23729", "O23730", "O23731"]
7098
>>> for accession in accessions:
7099
... handle = ExPASy.get_sprot_raw(accession)
7100
... record = SwissProt.read(handle)
7101
... records.append(record)
7104
If the accession number you provided to 'ExPASy.get_sprot_raw' does
7105
not exist, then 'SwissProt.read(handle)' will raise a 'ValueError'. You
7106
can catch 'ValueException' exceptions to detect invalid accession
7108
<<>>> for accession in accessions:
7109
... handle = ExPASy.get_sprot_raw(accession)
7111
... record = SwissProt.read(handle)
7112
... except ValueException:
7113
... print "WARNING: Accession %s not found" % accession
7114
... records.append(record)
7119
9.5.2 Searching Swiss-Prot
7120
===========================
7122
Now, you may remark that I knew the records' accession numbers
7123
beforehand. Indeed, 'get_sprot_raw()' needs either the entry name or an
7124
accession number. When you don't have them handy, you can use one of the
7125
'sprot_search_de()' or 'sprot_search_ful()' functions.
7126
'sprot_search_de()' searches in the ID, DE, GN, OS and OG lines;
7127
'sprot_search_ful()' searches in (nearly) all the fields. They are
7128
detailed on http://www.expasy.org/cgi-bin/sprot-search-de and
7129
http://www.expasy.org/cgi-bin/sprot-search-ful respectively. Note that
7130
they don't search in TrEMBL by default (argument 'trembl'). Note also
7131
that they return html pages; however, accession numbers are quite easily
7133
<<>>> from Bio import ExPASy
7136
>>> handle = ExPASy.sprot_search_de("Orchid Chalcone Synthase")
7138
>>> # handle = ExPASy.sprot_search_ful("Orchid and {Chalcone
7140
>>> html_results = handle.read()
7141
>>> if "Number of sequences found" in html_results:
7142
... ids = re.findall(r'HREF="/uniprot/(\w+)"', html_results)
7144
... ids = re.findall(r'href="/cgi-bin/niceprot\.pl\?(\w+)"',
7150
9.5.3 Retrieving Prosite and Prosite documentation records
7151
===========================================================
7153
Prosite and Prosite documentation records can be retrieved either in
7154
HTML format, or in raw format. To parse Prosite and Prosite
7155
documentation records with Biopython, you should retrieve the records in
7156
raw format. For other purposes, however, you may be interested in these
7157
records in HTML format.
7158
To retrieve a Prosite or Prosite documentation record in raw format,
7159
use 'get_prosite_raw()'. For example, to download a Prosite record and
7160
print it out in raw text format, use
7161
<<>>> from Bio import ExPASy
7162
>>> handle = ExPASy.get_prosite_raw('PS00001')
7163
>>> text = handle.read()
7167
To retrieve a Prosite record and parse it into a 'Bio.Prosite.Record'
7169
<<>>> from Bio import ExPASy
7170
>>> from Bio import Prosite
7171
>>> handle = ExPASy.get_prosite_raw('PS00001')
7172
>>> record = Prosite.read(handle)
7175
The same function can be used to retrieve a Prosite documentation
7176
record and parse it into a 'Bio.ExPASy.Prodoc.Record' object:
7177
<<>>> from Bio import ExPASy
7178
>>> from Bio.ExPASy import Prodoc
7179
>>> handle = ExPASy.get_prosite_raw('PDOC00001')
7180
>>> record = Prodoc.read(handle)
7183
For non-existing accession numbers, 'ExPASy.get_prosite_raw' returns a
7184
handle to an emptry string. When faced with an empty string,
7185
'Prosite.read' and 'Prodoc.read' will raise a ValueError. You can catch
7186
these exceptions to detect invalid accession numbers.
7187
The functions 'get_prosite_entry()' and 'get_prodoc_entry()' are used
7188
to download Prosite and Prosite documentation records in HTML format. To
7189
create a web page showing one Prosite record, you can use
7190
<<>>> from Bio import ExPASy
7191
>>> handle = ExPASy.get_prosite_entry('PS00001')
7192
>>> html = handle.read()
7193
>>> output = open("myprositerecord.html", "w")
7194
>>> output.write(html)
7198
and similarly for a Prosite documentation record:
7199
<<>>> from Bio import ExPASy
7200
>>> handle = ExPASy.get_prodoc_entry('PDOC00001')
7201
>>> html = handle.read()
7202
>>> output = open("myprodocrecord.html", "w")
7203
>>> output.write(html)
7207
For these functions, an invalid accession number returns an error
7208
message in HTML format.
7211
9.6 Scanning the Prosite database
7212
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
7215
ScanProsite (4) allows you to scan protein sequences online against
7216
the Prosite database by providing a UniProt or PDB sequence identifier
7217
or the sequence itself. For more information about ScanProsite, please
7218
see the ScanProsite documentation (5) as well as the documentation for
7219
programmatic access of ScanProsite (6).
7220
You can use Biopython's 'Bio.ExPASy.ScanProsite' module to scan the
7221
Prosite database from Python. This module both helps you to access
7222
ScanProsite programmatically, and to parse the results returned by
7223
ScanProsite. To scan for Prosite patterns in the following protein
7225
<<MEHKEVVLLLLLFLKSGQGEPLDDYVNTQGASLFSVTKKQLGAGSIEECAAKCEEDEEFT
7226
CRAFQYHSKEQQCVIMAENRKSSIIIRMRDVVLFEKKVYLSECKTGNGKNYRGTMSKTKN
7229
you can use the following code:
7231
"MEHKEVVLLLLLFLKSGQGEPLDDYVNTQGASLFSVTKKQLGAGSIEECAAKCEEDEEFT
7232
CRAFQYHSKEQQCVIMAENRKSSIIIRMRDVVLFEKKVYLSECKTGNGKNYRGTMSKTKN"
7233
>>> from Bio.ExPASy import ScanProsite
7234
>>> handle = ScanProsite.scan(seq=sequence)
7237
By executing 'handle.read()', you can obtain the search results in raw
7238
XML format. Instead, let's use 'Bio.ExPASy.ScanProsite.read' to parse
7239
the raw XML into a Python object:
7240
<<>>> result = ScanProsite.read(handle)
7242
<class 'Bio.ExPASy.ScanProsite.Record'>
7245
A 'Bio.ExPASy.ScanProsite.Record' object is derived from a list, with
7246
each element in the list storing one ScanProsite hit. This object also
7247
stores the number of hits, as well as the number of search sequences, as
7248
returned by ScanProsite. This ScanProsite search resulted in six hits:
7256
{'signature_ac': u'PS50948', 'level': u'0', 'stop': 98, 'sequence_ac':
7257
u'USERSEQ1', 'start': 16, 'score': u'8.873'}
7259
{'start': 37, 'stop': 39, 'sequence_ac': u'USERSEQ1', 'signature_ac':
7262
{'start': 45, 'stop': 48, 'sequence_ac': u'USERSEQ1', 'signature_ac':
7265
{'start': 60, 'stop': 62, 'sequence_ac': u'USERSEQ1', 'signature_ac':
7268
{'start': 80, 'stop': 83, 'sequence_ac': u'USERSEQ1', 'signature_ac':
7271
{'start': 106, 'stop': 111, 'sequence_ac': u'USERSEQ1',
7272
'signature_ac': u'PS00008'}
7275
Other ScanProsite parameters can be passed as keyword arguments; see
7276
the documentation for programmatic access of ScanProsite (7) for more
7277
information. As an example, passing 'lowscore=1' to include matches with
7278
low level scores lets use find one additional hit:
7279
<<>>> handle = ScanProsite.scan(seq=sequence, lowscore=1)
7280
>>> result = ScanProsite.read(handle)
7285
-----------------------------------
7288
(1) ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebas
7289
e/complete/uniprot_sprot.dat.gz
7291
(2) ftp://ftp.expasy.org/databases/prosite/prosite.dat
7293
(3) ftp://ftp.expasy.org/databases/enzyme/enzyme.dat
7295
(4) http://www.expasy.org/tools/scanprosite/
7297
(5) http://www.expasy.org/tools/scanprosite/scanprosite-doc.html
7299
(6) http://www.expasy.org/tools/scanprosite/ScanPrositeREST.html
7301
(7) http://www.expasy.org/tools/scanprosite/ScanPrositeREST.html
7304
Chapter 10 Going 3D: The PDB module
7305
**************************************
7307
Biopython also allows you to explore the extensive realm of
7308
macromolecular structure. Biopython comes with a PDBParser class that
7309
produces a Structure object. The Structure object can be used to access
7310
the atomic data in the file in a convenient manner.
7313
10.1 Structure representation
7314
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
7317
A macromolecular structure is represented using a structure, model
7318
chain, residue, atom (or SMCRA) hierarchy. The figure below shows a
7319
UML class diagram of the SMCRA data structure. Such a data structure is
7320
not necessarily best suited for the representation of the macromolecular
7321
content of a structure, but it is absolutely necessary for a good
7322
interpretation of the data present in a file that describes the
7323
structure (typically a PDB or MMCIF file). If this hierarchy cannot
7324
represent the contents of a structure file, it is fairly certain that
7325
the file contains an error or at least does not describe the structure
7326
unambiguously. If a SMCRA data structure cannot be generated, there is
7327
reason to suspect a problem. Parsing a PDB file can thus be used to
7328
detect likely problems. We will give several examples of this in section
7332
Structure, Model, Chain and Residue are all subclasses of the Entity
7333
base class. The Atom class only (partly) implements the Entity interface
7334
(because an Atom does not have children).
7335
For each Entity subclass, you can extract a child by using a unique id
7336
for that child as a key (e.g. you can extract an Atom object from a
7337
Residue object by using an atom name string as a key, you can extract a
7338
Chain object from a Model object by using its chain identifier as a
7340
Disordered atoms and residues are represented by DisorderedAtom and
7341
DisorderedResidue classes, which are both subclasses of the
7342
DisorderedEntityWrapper base class. They hide the complexity associated
7343
with disorder and behave exactly as Atom and Residue objects.
7344
In general, a child Entity object (i.e. Atom, Residue, Chain, Model)
7345
can be extracted from its parent (i.e. Residue, Chain, Model, Structure,
7346
respectively) by using an id as a key.
7347
<<child_entity=parent_entity[child_id]
7350
You can also get a list of all child Entities of a parent Entity
7351
object. Note that this list is sorted in a specific way (e.g. according
7352
to chain identifier for Chain objects in a Model object).
7353
<<child_list=parent_entity.get_list()
7356
You can also get the parent from a child.
7357
<<parent_entity=child_entity.get_parent()
7360
At all levels of the SMCRA hierarchy, you can also extract a full id.
7361
The full id is a tuple containing all id's starting from the top object
7362
(Structure) down to the current object. A full id for a Residue object
7363
e.g. is something like:
7364
<<full_id=residue.get_full_id()
7368
("1abc", 0, "A", ("", 10, "A"))
7371
This corresponds to:
7374
- The Structure with id "1abc"
7375
- The Model with id 0
7376
- The Chain with id "A"
7377
- The Residue with id (" ", 10, "A").
7378
The Residue id indicates that the residue is not a hetero-residue
7379
(nor a water) because it has a blank hetero field, that its sequence
7380
identifier is 10 and that its insertion code is "A".
7381
Some other useful methods:
7382
<<# get the entity's id
7386
# check if there is a child with a given id
7388
entity.has_id(entity_id)
7390
# get number of children
7392
nr_children=len(entity)
7395
It is possible to delete, rename, add, etc. child entities from a
7396
parent entity, but this does not include any sanity checks (e.g. it is
7397
possible to add two residues with the same id to one chain). This really
7398
should be done via a nice Decorator class that includes integrity
7399
checking, but you can take a look at the code (Entity.py) if you want to
7400
use the raw interface.
7406
The Structure object is at the top of the hierarchy. Its id is a user
7407
given string. The Structure contains a number of Model children. Most
7408
crystal structures (but not all) contain a single model, while NMR
7409
structures typically consist of several models. Disorder in crystal
7410
structures of large parts of molecules can also result in several
7414
10.1.1.1 Constructing a Structure object
7415
-----------------------------------------
7417
A Structure object is produced by a PDBParser object:
7418
<<from Bio.PDB.PDBParser import PDBParser
7420
p=PDBParser(PERMISSIVE=1)
7424
filename="pdb1fat.ent"
7426
s=p.get_structure(structure_id, filename)
7429
The PERMISSIVE flag indicates that a number of common problems (see
7430
10.5.1) associated with PDB files will be ignored (but note that some
7431
atoms and/or residues will be missing). If the flag is not present a
7432
PDBConstructionException will be generated during the parse operation.
7435
10.1.1.2 Header and trailer
7436
----------------------------
7438
You can extract the header and trailer (simple lists of strings) of
7439
the PDB file from the PDBParser object with the get_header and
7440
get_trailer methods.
7446
The id of the Model object is an integer, which is derived from the
7447
position of the model in the parsed file (they are automatically
7448
numbered starting from 0). The Model object stores a list of Chain
7455
Get the first model from a Structure object.
7456
<<first_model=structure[0]
7464
The id of a Chain object is derived from the chain identifier in the
7465
structure file, and can be any string. Each Chain in a Model object has
7466
a unique id. The Chain object stores a list of Residue children.
7472
Get the Chain object with identifier "A" from a Model object.
7473
<<chain_A=model["A"]
7481
Unsurprisingly, a Residue object stores a set of Atom children. In
7482
addition, it also contains a string that specifies the residue name
7483
(e.g. "ASN") and the segment identifier of the residue (well known to
7484
X-PLOR users, but not used in the construction of the SMCRA data
7486
The id of a Residue object is composed of three parts: the hetero
7487
field (hetfield), the sequence identifier (resseq) and the insertion
7489
The hetero field is a string : it is "W" for waters, "H_" followed by
7490
the residue name (e.g. "H_FUC") for other hetero residues and blank for
7491
standard amino and nucleic acids. This scheme is adopted for reasons
7492
described in section 10.3.1.
7493
The second field in the Residue id is the sequence identifier, an
7494
integer describing the position of the residue in the chain.
7495
The third field is a string, consisting of the insertion code. The
7496
insertion code is sometimes used to preserve a certain desirable residue
7497
numbering scheme. A Ser 80 insertion mutant (inserted e.g. between a Thr
7498
80 and an Asn 81 residue) could e.g. have sequence identifiers and
7499
insertion codes as followed: Thr 80 A, Ser 80 B, Asn 81. In this way the
7500
residue numbering scheme stays in tune with that of the wild type
7502
Let's give some examples. Asn 10 with a blank insertion code would
7503
have residue id (" ", 10, " "). Water 10 would have residue id ("W", 10,
7504
" "). A glucose molecule (a hetero residue with residue name GLC) with
7505
sequence identifier 10 would have residue id ("H_GLC", 10, " "). In this
7506
way, the three residues (with the same insertion code and sequence
7507
identifier) can be part of the same chain because their residue id's are
7509
In most cases, the hetflag and insertion code fields will be blank,
7510
e.g. (" ", 10, " "). In these cases, the sequence identifier can be used
7511
as a shortcut for the full id:
7514
res10=chain[("", 10, "")]
7521
Each Residue object in a Chain object should have a unique id.
7522
However, disordered residues are dealt with in a special way, as
7523
described in section 10.2.3.2.
7524
A Residue object has a number of additional methods:
7525
<<r.get_resname() # return residue name, e.g. "ASN"
7526
r.get_segid() # return the SEGID, e.g. "CHN1"
7534
The Atom object stores the data associated with an atom, and has no
7535
children. The id of an atom is its atom name (e.g. "OG" for the side
7536
chain oxygen of a Ser residue). An Atom id needs to be unique in a
7537
Residue. Again, an exception is made for disordered atoms, as described
7539
In a PDB file, an atom name consists of 4 chars, typically with
7540
leading and trailing spaces. Often these spaces can be removed for ease
7541
of use (e.g. an amino acid C alpha atom is labeled ".CA." in a PDB
7542
file, where the dots represent spaces). To generate an atom name (and
7543
thus an atom id) the spaces are removed, unless this would result in a
7544
name collision in a Residue (i.e. two Atom objects with the same atom
7545
name and id). In the latter case, the atom name including spaces is
7546
tried. This situation can e.g. happen when one residue contains atoms
7547
with names ".CA." and "CA..", although this is not very likely.
7548
The atomic data stored includes the atom name, the atomic coordinates
7549
(including standard deviation if present), the B factor (including
7550
anisotropic B factors and standard deviation if present), the altloc
7551
specifier and the full atom name including spaces. Less used items like
7552
the atom element number or the atomic charge sometimes specified in a
7553
PDB file are not stored.
7554
An Atom object has the following additional methods:
7555
<<a.get_name() # atom name (spaces stripped, e.g. "CA")
7556
a.get_id() # id (equals atom name)
7557
a.get_coord() # atomic coordinates
7558
a.get_bfactor() # B factor
7559
a.get_occupancy() # occupancy
7560
a.get_altloc() # alternative location specifie
7561
a.get_sigatm() # std. dev. of atomic parameters
7562
a.get_siguij() # std. dev. of anisotropic B factor
7563
a.get_anisou() # anisotropic B factor
7564
a.get_fullname() # atom name (with spaces, e.g. ".CA.")
7567
To represent the atom coordinates, siguij, anisotropic B factor and
7568
sigatm Numpy arrays are used.
7577
10.2.1 General approach
7578
========================
7580
Disorder should be dealt with from two points of view: the atom and
7581
the residue points of view. In general, we have tried to encapsulate all
7582
the complexity that arises from disorder. If you just want to loop over
7583
all C alpha atoms, you do not care that some residues have a disordered
7584
side chain. On the other hand it should also be possible to represent
7585
disorder completely in the data structure. Therefore, disordered atoms
7586
or residues are stored in special objects that behave as if there is no
7587
disorder. This is done by only representing a subset of the disordered
7588
atoms or residues. Which subset is picked (e.g. which of the two
7589
disordered OG side chain atom positions of a Ser residue is used) can be
7590
specified by the user.
7593
10.2.2 Disordered atoms
7594
========================
7596
Disordered atoms are represented by ordinary Atom objects, but all
7597
Atom objects that represent the same physical atom are stored in a
7598
DisorderedAtom object. Each Atom object in a DisorderedAtom object can
7599
be uniquely indexed using its altloc specifier. The DisorderedAtom
7600
object forwards all uncaught method calls to the selected Atom object,
7601
by default the one that represents the atom with with the highest
7602
occupancy. The user can of course change the selected Atom object,
7603
making use of its altloc specifier. In this way atom disorder is
7604
represented correctly without much additional complexity. In other
7605
words, if you are not interested in atom disorder, you will not be
7607
Each disordered atom has a characteristic altloc identifier. You can
7608
specify that a DisorderedAtom object should behave like the Atom object
7609
associated with a specific altloc identifier:
7610
<<atom.disordered\_select("A") # select altloc A atom
7612
print atom.get_altloc()
7615
atom.disordered_select("B") # select altloc B atom
7616
print atom.get_altloc()
7622
10.2.3 Disordered residues
7623
===========================
7627
10.2.3.1 Common case
7628
---------------------
7630
The most common case is a residue that contains one or more disordered
7631
atoms. This is evidently solved by using DisorderedAtom objects to
7632
represent the disordered atoms, and storing the DisorderedAtom object in
7633
a Residue object just like ordinary Atom objects. The DisorderedAtom
7634
will behave exactly like an ordinary atom (in fact the atom with the
7635
highest occupancy) by forwarding all uncaught method calls to one of the
7636
Atom objects (the selected Atom object) it contains.
7639
10.2.3.2 Point mutations
7640
-------------------------
7642
A special case arises when disorder is due to a point mutation, i.e.
7643
when two or more point mutants of a polypeptide are present in the
7644
crystal. An example of this can be found in PDB structure 1EN2.
7645
Since these residues belong to a different residue type (e.g. let's
7646
say Ser 60 and Cys 60) they should not be stored in a single Residue
7647
object as in the common case. In this case, each residue is represented
7648
by one Residue object, and both Residue objects are stored in a
7649
DisorderedResidue object.
7650
The DisorderedResidue object forwards all uncaught methods to the
7651
selected Residue object (by default the last Residue object added), and
7652
thus behaves like an ordinary residue. Each Residue object in a
7653
DisorderedResidue object can be uniquely identified by its residue name.
7654
In the above example, residue Ser 60 would have id "SER" in the
7655
DisorderedResidue object, while residue Cys 60 would have id "CYS". The
7656
user can select the active Residue object in a DisorderedResidue object
7660
10.3 Hetero residues
7661
*=*=*=*=*=*=*=*=*=*=*
7666
10.3.1 Associated problems
7667
===========================
7669
A common problem with hetero residues is that several hetero and
7670
non-hetero residues present in the same chain share the same sequence
7671
identifier (and insertion code). Therefore, to generate a unique id for
7672
each hetero residue, waters and other hetero residues are treated in a
7674
Remember that Residue object have the tuple (hetfield, resseq, icode)
7675
as id. The hetfield is blank (" ") for amino and nucleic acids, and a
7676
string for waters and other hetero residues. The content of the hetfield
7680
10.3.2 Water residues
7681
======================
7683
The hetfield string of a water residue consists of the letter "W". So
7684
a typical residue id for a water is ("W", 1, " ").
7687
10.3.3 Other hetero residues
7688
=============================
7690
The hetfield string for other hetero residues starts with "H_"
7691
followed by the residue name. A glucose molecule e.g. with residue name
7692
"GLC" would have hetfield "H_GLC". It's residue id could e.g. be
7696
10.4 Some random usage examples
7697
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
7700
Parse a PDB file, and extract some Model, Chain, Residue and Atom
7702
<<from Bio.PDB.PDBParser import PDBParser
7706
structure=parser.get_structure("test", "1fat.pdb")
7713
Extract a hetero residue from a chain (e.g. a glucose (GLC) moiety
7715
<<residue_id=("H_GLC", 10, " ")
7716
residue=chain[residue_id]
7719
Print all hetero residues in chain.
7720
<<for residue in chain.get_list():
7721
residue_id=residue.get_id()
7722
hetfield=residue_id[0]
7723
if hetfield[0]=="H":
7727
Print out the coordinates of all CA atoms in a structure with B factor
7729
<<for model in structure.get_list():
7730
for chain in model.get_list():
7731
for residue in chain.get_list():
7732
if residue.has_id("CA"):
7734
if ca.get_bfactor()>50.0:
7735
print ca.get_coord()
7738
Print out all the residues that contain disordered atoms.
7739
<<for model in structure.get_list():
7740
for chain in model.get_list():
7741
for residue in chain.get_list():
7742
if residue.is_disordered():
7743
resseq=residue.get_id()[1]
7744
resname=residue.get_resname()
7745
model_id=model.get_id()
7746
chain_id=chain.get_id()
7747
print model_id, chain_id, resname, resseq
7750
Loop over all disordered atoms, and select all atoms with altloc A (if
7751
present). This will make sure that the SMCRA data structure will behave
7752
as if only the atoms with altloc A are present.
7753
<<for model in structure.get_list():
7754
for chain in model.get_list():
7755
for residue in chain.get_list():
7756
if residue.is_disordered():
7757
for atom in residue.get_list():
7758
if atom.is_disordered():
7759
if atom.disordered_has_id("A"):
7760
atom.disordered_select("A")
7763
Suppose that a chain has a point mutation at position 10, consisting
7764
of a Ser and a Cys residue. Make sure that residue 10 of this chain
7765
behaves as the Cys residue.
7767
residue.disordered_select("CYS")
7772
10.5 Common problems in PDB files
7773
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
7781
The PDBParser/Structure class was tested on about 800 structures (each
7782
belonging to a unique SCOP superfamily). This takes about 20 minutes, or
7783
on average 1.5 seconds per structure. Parsing the structure of the large
7784
ribosomal subunit (1FKK), which contains about 64000 atoms, takes 10
7785
seconds on a 1000 MHz PC.
7786
Three exceptions were generated in cases where an unambiguous data
7787
structure could not be built. In all three cases, the likely cause is an
7788
error in the PDB file that should be corrected. Generating an exception
7789
in these cases is much better than running the chance of incorrectly
7790
describing the structure in a data structure.
7793
10.5.1.1 Duplicate residues
7794
----------------------------
7796
One structure contains two amino acid residues in one chain with the
7797
same sequence identifier (resseq 3) and icode. Upon inspection it was
7798
found that this chain contains the residues Thr A3, ..., Gly A202, Leu
7799
A3, Glu A204. Clearly, Leu A3 should be Leu A203. A couple of similar
7800
situations exist for structure 1FFK (which e.g. contains Gly B64, Met
7801
B65, Glu B65, Thr B67, i.e. residue Glu B65 should be Glu B66).
7804
10.5.1.2 Duplicate atoms
7805
-------------------------
7807
Structure 1EJG contains a Ser/Pro point mutation in chain A at
7808
position 22. In turn, Ser 22 contains some disordered atoms. As
7809
expected, all atoms belonging to Ser 22 have a non-blank altloc
7810
specifier (B or C). All atoms of Pro 22 have altloc A, except the N atom
7811
which has a blank altloc. This generates an exception, because all atoms
7812
belonging to two residues at a point mutation should have non-blank
7813
altloc. It turns out that this atom is probably shared by Ser and Pro
7814
22, as Ser 22 misses the N atom. Again, this points to a problem in the
7815
file: the N atom should be present in both the Ser and the Pro residue,
7816
in both cases associated with a suitable altloc identifier.
7819
10.5.2 Automatic correction
7820
============================
7822
Some errors are quite common and can be easily corrected without much
7823
risk of making a wrong interpretation. These cases are listed below.
7826
10.5.2.1 A blank altloc for a disordered atom
7827
----------------------------------------------
7829
Normally each disordered atom should have a non-blank altloc
7830
identifier. However, there are many structures that do not follow this
7831
convention, and have a blank and a non-blank identifier for two
7832
disordered positions of the same atom. This is automatically interpreted
7836
10.5.2.2 Broken chains
7837
-----------------------
7839
Sometimes a structure contains a list of residues belonging to chain
7840
A, followed by residues belonging to chain B, and again followed by
7841
residues belonging to chain A, i.e. the chains are "broken". This is
7842
correctly interpreted.
7846
====================
7848
Sometimes a PDB file cannot be unambiguously interpreted. Rather than
7849
guessing and risking a mistake, an exception is generated, and the user
7850
is expected to correct the PDB file. These cases are listed below.
7853
10.5.3.1 Duplicate residues
7854
----------------------------
7856
All residues in a chain should have a unique id. This id is generated
7860
- The sequence identifier (resseq).
7861
- The insertion code (icode).
7862
- The hetfield string ("W" for waters and "H_" followed by the
7863
residue name for other hetero residues)
7864
- The residue names of the residues in the case of point mutations
7865
(to store the Residue objects in a DisorderedResidue object).
7866
If this does not lead to a unique id something is quite likely wrong,
7867
and an exception is generated.
7870
10.5.3.2 Duplicate atoms
7871
-------------------------
7873
All atoms in a residue should have a unique id. This id is generated
7877
- The atom name (without spaces, or with spaces if a problem arises).
7879
- The altloc specifier.
7880
If this does not lead to a unique id something is quite likely wrong,
7881
and an exception is generated.
7885
*=*=*=*=*=*=*=*=*=*=
7888
There are also some tools to analyze a crystal structure. Tools exist
7889
to superimpose two coordinate sets (SVDSuperimposer), to extract
7890
polypeptides from a structure (Polypeptide), to perform neighbor lookup
7891
(NeighborSearch) and to write out PDB files (PDBIO). The neighbor lookup
7892
is done using a KD tree module written in C++. It is very fast and also
7893
includes a fast method to find all point pairs within a certain distance
7895
A Polypeptide object is simply a UserList of Residue objects. You can
7896
construct a list of Polypeptide objects from a Structure object as
7899
polypeptide_list=build_peptides(structure, model_nr)
7901
for polypeptide in polypeptide_list:
7905
The Polypeptide objects are always created from a single Model (in
7909
Chapter 11 Bio.PopGen: Population genetics
7910
*********************************************
7912
Bio.PopGen is a new Biopython module supporting population genetics,
7913
available in Biopython 1.44 onwards.
7914
The medium term objective for the module is to support widely used
7915
data formats, applications and databases. This module is currently under
7916
intense development and support for new features should appear at a
7917
rather fast pace. Unfortunately this might also entail some instability
7918
on the API, especially if you are using a CVS version. APIs that are
7919
made available on public versions should be much stabler.
7926
GenePop (http://genepop.curtin.edu.au/) is a popular population
7927
genetics software package supporting Hardy-Weinberg tests, linkage
7928
desiquilibrium, population diferentiation, basic statistics, F_st and
7929
migration estimates, among others. GenePop does not supply sequence
7930
based statistics as it doesn't handle sequence data. The GenePop file
7931
format is supported by a wide range of other population genetic software
7932
applications, thus making it a relevant format in the population
7934
Bio.PopGen provides a parser and generator of GenePop file format.
7935
Utilities to manipulate the content of a record are also provided. Here
7936
is an example on how to read a GenePop file (you can find example
7937
GenePop data files in the Test/PopGen directory of Biopython):
7938
<<from Bio.PopGen import GenePop
7940
handle = open("example.gen")
7941
rec = GenePop.parse(handle)
7945
This will read a file called example.gen and parse it. If you do print
7946
rec, the record will be output again, in GenePop format.
7947
The most important information in rec will be the loci names and
7948
population information (but there is more -- use help(GenePop.Record) to
7949
check the API documentation). Loci names can be found on rec.loci_list.
7950
Population information can be found on rec.populations. Populations is a
7951
list with one element per population. Each element is itself a list of
7952
individuals, each individual is a pair composed by individual name and a
7953
list of alleles (2 per marker), here is an example for rec.populations:
7956
('Ind1', [(1, 2), (3, 3), (200, 201)],
7957
('Ind2', [(2, None), (3, 3), (None, None)],
7960
('Other1', [(1, 1), (4, 3), (200, 200)],
7965
So we have two populations, the first with two individuals, the second
7966
with only one. The first individual of the first population is called
7967
Ind1, allelic information for each of the 3 loci follows. Please note
7968
that for any locus, information might be missing (see as an example,
7970
A few utility functions to manipulate GenePop records are made
7971
available, here is an example:
7972
<<from Bio.PopGen import GenePop
7974
#Imagine that you have loaded rec, as per the code snippet above...
7976
rec.remove_population(pos)
7977
#Removes a population from a record, pos is the population position in
7978
# rec.populations, remember that it starts on position 0.
7981
rec.remove_locus_by_position(pos)
7982
#Removes a locus by its position, pos is the locus position in
7983
# rec.loci_list, remember that it starts on position 0.
7986
rec.remove_locus_by_name(name)
7987
#Removes a locus by its name, name is the locus name as in
7988
# rec.loci_list. If the name doesn't exist the function fails
7992
rec_loci = rec.split_in_loci()
7993
#Splits a record in loci, that is, for each loci, it creates a new
7994
# record, with a single loci and all populations.
7995
# The result is returned in a dictionary, being each key the locus
7997
# The value is the GenePop record.
7998
# rec is not altered.
8000
rec_pops = rec.split_in_pops(pop_names)
8001
#Splits a record in populations, that is, for each population, it
8003
# a new record, with a single population and all loci.
8004
# The result is returned in a dictionary, being each key
8005
# the population name. As population names are not available in
8007
# they are passed in array (pop_names).
8008
# The value of each dictionary entry is the GenePop record.
8009
# rec is not altered.
8012
GenePop does not support population names, a limitation which can be
8013
cumbersome at times. Functionality to enable population names is
8014
currently being planned for Biopython. These extensions won't break
8015
compatibility in any way with the standard format. In the medium term,
8016
we would also like to support the GenePop web service.
8019
11.2 Coalescent simulation
8020
*=*=*=*=*=*=*=*=*=*=*=*=*=*
8023
A coalescent simulation is a backward model of population genetics
8024
with relation to time. A simulation of ancestry is done until the Most
8025
Recent Common Ancestor (MRCA) is found. This ancestry relationship
8026
starting on the MRCA and ending on the current generation sample is
8027
sometimes called a genealogy. Simple cases assume a population of
8028
constant size in time, haploidy, no population structure, and simulate
8029
the alleles of a single locus under no selection pressure.
8030
Coalescent theory is used in many fields like selection detection,
8031
estimation of demographic parameters of real populations or disease gene
8033
The strategy followed in the Biopython implementation of the
8034
coalescent was not to create a new, built-in, simulator from scratch but
8035
to use an existing one, SIMCOAL2
8036
(http://cmpg.unibe.ch/software/simcoal2/). SIMCOAL2 allows for, among
8037
others, population structure, multiple demographic events, simulation of
8038
multiple types of loci (SNPs, sequences, STRs/microsatellites and RFLPs)
8039
with recombination, diploidy multiple chromosomes or ascertainment bias.
8040
Notably SIMCOAL2 doesn't support any selection model. We recommend
8041
reading SIMCOAL2's documentation, available in the link above.
8042
The input for SIMCOAL2 is a file specifying the desired demography and
8043
genome, the output is a set of files (typically around 1000) with the
8044
simulated genomes of a sample of individuals per subpopulation. This set
8045
of files can be used in many ways, like to compute confidence intervals
8046
where which certain statistics (e.g., F_st or Tajima D) are expected to
8047
lie. Real population genetics datasets statistics can then be compared
8048
to those confidence intervals.
8049
Biopython coalescent code allows to create demographic scenarios and
8050
genomes and to run SIMCOAL2.
8053
11.2.1 Creating scenarios
8054
==========================
8056
Creating a scenario involves both creating a demography and a
8057
chromosome structure. In many cases (e.g. when doing Approximate
8058
Bayesian Computations -- ABC) it is important to test many parameter
8059
variations (e.g. vary the effective population size, N_e, between 10,
8060
50, 500 and 1000 individuals). The code provided allows for the
8061
simulation of scenarios with different demographic parameters very
8063
Below we see how we can create scenarios and then how simulate them.
8067
--------------------
8069
A few predefined demographies are built-in, all have two shared
8070
parameters: sample size (called sample_size on the template, see below
8071
for its use) per deme and deme size, i.e. subpopulation size (pop_size).
8072
All demographies are available as templates where all parameters can be
8073
varied, each template has a system name. The prefedined
8074
demographies/templates are:
8077
Single population, constant size The standard parameters are enough to
8078
specifity it. Template name: simple.
8079
Single population, bottleneck As seen on figure 11.2.1.1. The
8080
parameters are current population size (pop_size on template ne3 on
8081
figure), time of expansion, given as the generation in the past when
8082
it occured (expand_gen), effective population size during bottleneck
8083
(ne2), time of contraction (contract_gen) and original size in the
8084
remote past (ne3). Template name: bottle.
8085
Island model The typical island model. The total number of demes is
8086
specified by total_demes and the migration rate by mig. Template name
8088
Stepping stone model - 1 dimension The stepping stone model in 1
8089
dimension, extremes disconnected. The total number of demes is
8090
total_demes, migration rate is mig. Template name is ssm_1d.
8091
Stepping stone model - 2 dimensions The stepping stone model in 2
8092
dimensions, extremes disconnected. The parameters are x for the
8093
horizontal dimension and y for the vertical (being the total number
8094
of demes x times y), migration rate is mig. Template name is ssm_2d.
8098
In our first example, we will generate a template for a single
8099
population, constant size model with a sample size of 30 and a deme size
8100
of 500. The code for this is:
8101
<<from Bio.PopGen.SimCoal.Template import generate_simcoal_from_template
8103
generate_simcoal_from_template('simple',
8104
[(1, [('SNP', [24, 0.0005, 0.0])])],
8105
[('sample_size', [30]),
8106
('pop_size', [100])])
8109
Executing this code snippet will generate a file on the current
8110
directory called simple_100_300.par this file can be given as input to
8111
SIMCOAL2 to simulate the demography (below we will see how Biopython can
8112
take care of calling SIMCOAL2).
8113
This code consists of a single function call, let's discuss it
8114
parameter by parameter.
8115
The first parameter is the template id (from the list above). We are
8116
using the id 'simple' which is the template for a single population of
8117
constant size along time.
8118
The second parameter is the chromosome structure. Please ignore it for
8119
now, it will be explained in the next section.
8120
The third parameter is a list of all required parameters (recall that
8121
the simple model only needs sample_size and pop_size) and possible
8122
values (in this case each parameter only has a possible value).
8123
Now, let's consider an example where we want to generate several
8124
island models, and we are interested in varying the number of demes: 10,
8125
50 and 100 with a migration rate of 1%. Sample size and deme size will
8126
be the same as before. Here is the code:
8127
<<from Bio.PopGen.SimCoal.Template import generate_simcoal_from_template
8129
generate_simcoal_from_template('island',
8130
[(1, [('SNP', [24, 0.0005, 0.0])])],
8131
[('sample_size', [30]),
8132
('pop_size', [100]),
8134
('total_demes', [10, 50, 100])])
8137
In this case, 3 files will be generated: island_100_0.01_100_30.par,
8138
island_10_0.01_100_30.par and island_50_0.01_100_30.par. Notice the rule
8139
to make file names: template name, followed by parameter values in
8141
A few, arguably more esoteric template demographies exist (please
8142
check the Bio/PopGen/SimCoal/data directory on Biopython source tree).
8143
Furthermore it is possible for the user to create new templates. That
8144
functionality will be discussed in a future version of this document.
8147
11.2.1.2 Chromosome structure
8148
------------------------------
8150
We strongly recommend reading SIMCOAL2 documentation to understand the
8151
full potential available in modeling chromosome structures. In this
8152
subsection we only discuss how to implement chromosome structures using
8153
the Biopython interface, not the underlying SIMCOAL2 capabilities.
8154
We will start by implementing a single chromosome, with 24 SNPs with a
8155
recombination rate immediately on the right of each locus of 0.0005 and
8156
a minimum frequency of the minor allele of 0. This will be specified by
8157
the following list (to be passed as second parameter to the function
8158
generate_simcoal_from_template):
8159
<<[(1, [('SNP', [24, 0.0005, 0.0])])]
8162
This is actually the chromosome structure used in the above examples.
8163
The chromosome structure is represented by a list of chromosomes, each
8164
chromosome (i.e., each element in the list) is composed by a tuple (a
8165
pair): the first element is the number of times the chromosome is to be
8166
repeated (as there might be interest in repeating the same chromosome
8167
many times). The second element is a list of the actual components of
8168
the chromosome. Each element is again a pair, the first member is the
8169
locus type and the second element the parameters for that locus type.
8170
Confused? Before showing more examples let's review the example above:
8171
We have a list with one element (thus one chromosome), the chromosome is
8172
a single instance (therefore not to be repeated), it is composed of 24
8173
SNPs, with a recombination rate of 0.0005 between each consecutive SNP,
8174
the minimum frequency of the minor allele is 0.0 (i.e, it can be absent
8175
from a certain population).
8176
Let's see a more complicated example:
8179
('SNP', [24, 0.0005, 0.0])
8183
('DNA', [10, 0.0, 0.00005, 0.33]),
8184
('RFLP', [1, 0.0, 0.0001]),
8185
('MICROSAT', [1, 0.0, 0.001, 0.0, 0.0])
8191
We start by having 5 chromosomes with the same structure as above
8192
(i.e., 24 SNPs). We then have 2 chromosomes which have a DNA sequence
8193
with 10 nucleotides, 0.0 recombination rate, 0.0005 mutation rate, and a
8194
transition rate of 0.33. Then we have an RFLP with 0.0 recombination
8195
rate to the next locus and a 0.0001 mutation rate. Finally we have a
8196
microsatellite (or STR), with 0.0 recombination rate to the next locus
8197
(note, that as this is a single microsatellite which has no loci
8198
following, this recombination rate here is irrelevant), with a mutation
8199
rate of 0.001, geometric paramater of 0.0 and a range constraint of 0.0
8200
(for information about this parameters please consult the SIMCOAL2
8201
documentation, you can use them to simulate various mutation models,
8202
including the typical -- for microsatellites -- stepwise mutation model
8206
11.2.2 Running SIMCOAL2
8207
========================
8209
We now discuss how to run SIMCOAL2 from inside Biopython. It is
8210
required that the binary for SIMCOAL2 is called simcoal2 (or
8211
simcoal2.exe on Windows based platforms), please note that the typical
8212
name when downloading the program is in the format simcoal2_x_y. As
8213
such, when installing SIMCOAL2 you will need to rename of the downloaded
8214
executable so that Biopython can find it.
8215
It is possible to run SIMCOAL2 on files that were not generated using
8216
the method above (e.g., writing a parameter file by hand), but we will
8217
show an example by creating a model using the framework presented above.
8218
<<from Bio.PopGen.SimCoal.Template import generate_simcoal_from_template
8219
from Bio.PopGen.SimCoal.Controller import SimCoalController
8222
generate_simcoal_from_template('simple',
8225
('SNP', [24, 0.0005, 0.0])
8229
('DNA', [10, 0.0, 0.00005, 0.33]),
8230
('RFLP', [1, 0.0, 0.0001]),
8231
('MICROSAT', [1, 0.0, 0.001, 0.0, 0.0])
8235
[('sample_size', [30]),
8236
('pop_size', [100])])
8238
ctrl = SimCoalController('.')
8239
ctrl.run_simcoal('simple_100_30.par', 50)
8242
The lines of interest are the last two (plus the new import). Firstly
8243
a controller for the application is created. The directory where the
8244
binary is located has to be specified.
8245
The simulator is then run on the last line: we know, from the rules
8246
explained above, that the input file name is simple_100_30.par for the
8247
simulation parameter file created. We then specify that we want to run
8248
50 independent simulations, by default Biopython requests a simulation
8249
of diploid data, but a third parameter can be added to simulate haploid
8250
data (adding as a parameter the string '0'). SIMCOAL2 will now run
8251
(please note that this can take quite a lot of time) and will create a
8252
directory with the simulation results. The results can now be analysed
8253
(typically studying the data with Arlequin3). In the future Biopython
8254
might support reading the Arlequin3 format and thus allowing for the
8255
analysis of SIMCOAL2 data inside Biopython.
8258
11.3 Other applications
8259
*=*=*=*=*=*=*=*=*=*=*=*=
8262
Here we discuss interfaces and utilities to deal with population
8263
genetics' applications which arguably have a smaller user base.
8266
11.3.1 FDist: Detecting selection and molecular adaptation
8267
===========================================================
8269
FDist is a selection detection application suite based on computing
8270
(i.e. simulating) a "neutral" confidence interval based on F_st and
8271
heterozygosity. Markers (which can be SNPs, microsatellites, AFLPs among
8272
others) which lie outside the "neutral" interval are to be considered as
8273
possible candidates for being under selection.
8274
FDist is mainly used when the number of markers is considered enough
8275
to estimate an average F_st, but not enough to either have outliers
8276
calculated from the dataset directly or, with even more markers for
8277
which the relative positions in the genome are known, to use approaches
8278
based on, e.g., Extended Haplotype Heterozygosity (EHH).
8279
The typical usage pattern for FDist is as follows:
8282
1. Import a dataset from an external format into FDist format.
8283
2. Compute average F_st. This is done by datacal inside FDist.
8284
3. Simulate "neutral" markers based on the average F_st and expected
8285
number of total populations. This is the core operation, done by
8287
4. Calculate the confidence interval, based on the desired confidence
8288
boundaries (typically 95% or 99%). This is done by cplot and is
8289
mainly used to plot the interval.
8290
5. Assess each marker status against the simulation "neutral"
8291
confidence interval. Done by pv. This is used to detect the outlier
8292
status of each marker against the simulation.
8294
We will now discuss each step with illustrating example code (for this
8295
example to work FDist binaries have to be on the executable PATH).
8296
The FDist data format is application specific and is not used at all
8297
by other applications, as such you will probably have to convert your
8298
data for use with FDist. Biopython can help you do this. Here is an
8299
example converting from GenePop format to FDist format (along with
8300
imports that will be needed on examples further below):
8301
<<from Bio.PopGen import GenePop
8302
from Bio.PopGen import FDist
8303
from Bio.PopGen.FDist import Controller
8304
from Bio.PopGen.FDist.Utils import convert_genepop_to_fdist
8306
gp_rec = GenePop.parse(open("example.gen"))
8307
fd_rec = convert_genepop_to_fdist(gp_rec)
8308
in_file = open("infile", "w")
8309
in_file.write(str(fd_rec))
8313
In this code we simply parse a GenePop file and convert it to a FDist
8315
Printing an FDist record will generate a string that can be directly
8316
saved to a file and supplied to FDist. FDist requires the input file to
8317
be called infile, therefore we save the record on a file with that name.
8318
The most important fields on a FDist record are: num_pops, the number
8319
of populations; num_loci, the number of loci and loci_data with the
8320
marker data itself. Most probably the details of the record are of no
8321
interest to the user, as the record only purpose is to be passed to
8323
The next step is to calculate the average F_st of the dataset (along
8324
with the sample size):
8325
<<ctrl = Controller.FDistController()
8326
fst, samp_size = ctrl.run_datacal()
8329
On the first line we create an object to control the call of FDist
8330
suite, this object will be used further on in order to call other suite
8332
On the second line we call the datacal application which computes the
8333
average F_st and the sample size. It is worth noting that the F_st
8334
computed by datacal is a variation of Weir and Cockerham's theta.
8335
We can now call the main fdist application in order to simulate
8337
<<sim_fst = ctrl.run_fdist(npops = 15, nsamples = fd_rec.num_pops, fst =
8339
sample_size = samp_size, mut = 0, num_sims = 40000)
8344
npops Number of populations existing in nature. This is really a
8345
"guestimate". Has to be lower than 100.
8346
nsamples Number of populations sampled, has to be lower than npops.
8348
sample_size Average number of individuals sampled on each population.
8349
mut Mutation model: 0 - Infinite alleles; 1 - Stepwise mutations
8350
num_sims Number of simulations to perform. Typically a number around
8351
40000 will be OK, but if you get a confidence interval that looks
8352
sharp (this can be detected when plotting the confidence interval
8353
computed below) the value can be increased (a suggestion would be
8354
steps of 10000 simulations).
8356
The confusion in wording between number of samples and sample size
8357
stems from the original application.
8358
A file named out.dat will be created with the simulated
8359
heterozygosities and F_sts, it will have as many lines as the number of
8360
simulations requested.
8361
Note that fdist returns the average F_st that it was capable of
8362
simulating, for more details about this issue please read below the
8363
paragraph on approximating the desired average F_st.
8364
The next (optional) step is to calculate the confidence interval:
8365
<<cpl_interval = ctrl.run_cplot(ci=0.99)
8368
You can only call cplot after having run fdist.
8369
This will calculate the confidence intervals (99% in this case) for a
8370
previous fdist run. A list of quadruples is returned. The first element
8371
represents the heterozygosity, the second the lower bound of F_st
8372
confidence interval for that heterozygosity, the third the average and
8373
the fourth the upper bound. This can be used to trace the confidence
8374
interval contour. This list is also written to a file, out.cpl.
8375
The main purpose of this step is return a set of points which can be
8376
easily used to plot a confidence interval. It can be skipped if the
8377
objective is only to assess the status of each marker against the
8378
simulation, which is the next step...
8379
<<pv_data = ctrl.run_pv()
8382
You can only call cplot after having run datacal and fdist.
8383
This will use the simulated markers to assess the status of each
8384
individual real marker. A list, in the same order than the loci_list
8385
that is on the FDist record (which is in the same order that the GenePop
8386
record) is returned. Each element in the list is a quadruple, the
8387
fundamental member of each quadruple is the last element (regarding the
8388
other elements, please refer to the pv documentation -- for the sake of
8389
simplicity we will not discuss them here) which returns the probability
8390
of the simulated F_st being lower than the marker F_st. Higher values
8391
would indicate a stronger candidate for positive selection, lower values
8392
a candidate for balancing selection, and intermediate values a possible
8393
neutral marker. What is "higher", "lower" or "intermediate" is really a
8394
subjective issue, but taking a "confidence interval" approach and
8395
considering a 95% confidence interval, "higher" would be between 0.95
8396
and 1.0, "lower" between 0.0 and 0.05 and "intermediate" between 0.05
8400
11.3.1.1 Approximating the desired average F_st
8401
------------------------------------------------
8403
Fdist tries to approximate the desired average F_st by doing a
8404
coalescent simulation using migration rates based on the formula
8413
This formula assumes a few premises like an infinite number of
8415
In practice, when the number of populations is low, the mutation model
8416
is stepwise and the sample size increases, fdist will not be able to
8417
simulate an acceptable approximate average F_st.
8418
To address that, a function is provided to iteratively approach the
8419
desired value by running several fdists in sequence. This approach is
8420
computationally more intensive than running a single fdist run, but
8421
yields good results. The following code runs fdist approximating the
8423
<<sim_fst = ctrl.run_fdist_force_fst(npops = 15, nsamples =
8425
fst = fst, sample_size = samp_size, mut = 0, num_sims = 40000,
8429
The only new optional parameter, when comparing with run_fdist, is
8430
limit which is the desired maximum error. run_fdist can (and probably
8431
should) be safely replaced with run_fdist_force_fst.
8434
11.3.1.2 Final notes
8435
---------------------
8437
The process to determine the average F_st can be more sophisticated
8438
than the one presented here. For more information we refer you to the
8439
FDist README file. Biopython's code can be used to implement more
8440
sophisticated approaches.
8443
11.4 Future Developments
8444
*=*=*=*=*=*=*=*=*=*=*=*=*
8447
The most desired future developments would be the ones you add
8449
That being said, already existing fully functional code is currently
8450
being incorporated in Bio.PopGen, that code covers the applications
8451
FDist and SimCoal2, the HapMap and UCSC Table Browser databases and some
8452
simple statistics like F_st, or allele counts.
8455
Chapter 12 Supervised learning methods
8456
*****************************************
8458
Note the supervised learning methods described in this chapter all
8459
require Numerical Python (numpy) to be installed.
8462
12.1 The Logistic Regression Model
8463
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
8468
12.1.1 Background and Purpose
8469
==============================
8471
Logistic regression is a supervised learning approach that attempts to
8472
distinguish K classes from each other using a weighted sum of some
8473
predictor variables x_i. The logistic regression model is used to
8474
calculate the weights beta_i of the predictor variables. In Biopython,
8475
the logistic regression model is currently implemented for two classes
8476
only (K = 2); the number of predictor variables has no predefined limit.
8477
As an example, let's try to predict the operon structure in bacteria.
8478
An operon is a set of adjacent genes on the same strand of DNA that are
8479
transcribed into a single mRNA molecule. Translation of the single mRNA
8480
molecule then yields the individual proteins. For Bacillus subtilis,
8481
whose data we will be using, the average number of genes in an operon is
8483
As a first step in understanding gene regulation in bacteria, we need
8484
to know the operon structure. For about 10% of the genes in Bacillus
8485
subtilis, the operon structure is known from experiments. A supervised
8486
learning method can be used to predict the operon structure for the
8487
remaining 90% of the genes.
8488
For such a supervised learning approach, we need to choose some
8489
predictor variables x_i that can be measured easily and are somehow
8490
related to the operon structure. One predictor variable might be the
8491
distance in base pairs between genes. Adjacent genes belonging to the
8492
same operon tend to be separated by a relatively short distance, whereas
8493
adjacent genes in different operons tend to have a larger space between
8494
them to allow for promoter and terminator sequences. Another predictor
8495
variable is based on gene expression measurements. By definition, genes
8496
belonging to the same operon have equal gene expression profiles, while
8497
genes in different operons are expected to have different expression
8498
profiles. In practice, the measured expression profiles of genes in the
8499
same operon are not quite identical due to the presence of measurement
8500
errors. To assess the similarity in the gene expression profiles, we
8501
assume that the measurement errors follow a normal distribution and
8502
calculate the corresponding log-likelihood score.
8503
We now have two predictor variables that we can use to predict if two
8504
adjacent genes on the same strand of DNA belong to the same operon:
8506
- x_1: the number of base pairs between them;
8507
- x_2: their similarity in expression profile.
8509
In a logistic regression model, we use a weighted sum of these two
8510
predictors to calculate a joint score S:
8511
S = beta + beta x + beta x
8514
The logistic regression model gives us appropriate values for the
8515
parameters beta_0, beta_1, beta_2 using two sets of example genes:
8517
- OP: Adjacent genes, on the same strand of DNA, known to belong to
8519
- NOP: Adjacent genes, on the same strand of DNA, known to belong to
8522
In the logistic regression model, the probability of belonging to a
8523
class depends on the score via the logistic function. For the two
8524
classes OP and NOP, we can write this as
8525
(beta + beta x + beta x
8531
Pr(OP ) = ----------------------------------
8533
1 2 (beta + beta x + beta x
8541
|x , x ----------------------------------
8543
Pr(NOP ) = (beta + beta x + beta x
8549
Using a set of gene pairs for which it is known whether they belong
8550
to the same operon (class OP) or to different operons (class NOP), we
8551
can calculate the weights beta_0, beta_1, beta_2 by maximizing the
8552
log-likelihood corresponding to the probability functions (12.2) and
8556
12.1.2 Training the logistic regression model
8557
==============================================
8559
--------------------------------------------------------
8562
Table 12.1: Adjacent gene pairs known to belong to the same operon
8563
(class OP) or to different operons (class NOP). Intergene distances are
8564
negative if the two genes overlap.
8566
------------------------------------------------------------------------
8568
| Gene pair |Intergene distance (x_1)|Gene expression score
8570
------------------------------------------------------------------------
8572
|cotJA --- cotJB| -53 | -200.78 |
8574
| yesK --- yesL | 117 | -267.14 |
8576
| lplA --- lplB | 57 | -163.47 |
8578
| lplB --- lplC | 16 | -190.30 |
8580
| lplC --- lplD | 11 | -220.94 |
8582
| lplD --- yetF | 85 | -193.94 |
8584
| yfmT --- yfmS | 16 | -182.71 |
8586
| yfmF --- yfmE | 15 | -180.41 |
8588
| citS --- citT | -26 | -181.73 |
8590
| citM --- yflN | 58 | -259.87 |
8592
| yfiI --- yfiJ | 126 | -414.53 |
8594
| lipB --- yfiQ | 191 | -249.57 |
8596
| yfiU --- yfiV | 113 | -265.28 |
8598
| yfhH --- yfhI | 145 | -312.99 |
8600
| cotY --- cotX | 154 | -213.83 |
8602
| yjoB --- rapA | 147 | -380.85 |
8604
| ptsI --- splA | 93 | -291.13 |
8606
------------------------------------------------------------------------
8610
--------------------------------------------------------
8612
Table 12.1 lists some of the Bacillus subtilis gene pairs for which
8613
the operon structure is known. Let's calculate the logistic regression
8614
model from these data:
8615
<<>>> from Bio import LogisticRegression
8616
>>> xs = [[-53, -200.78],
8650
>>> model = LogisticRegression.train(xs, ys)
8653
Here, 'xs' and 'ys' are the training data: 'xs' contains the predictor
8654
variables for each gene pair, and 'ys' specifies if the gene pair
8655
belongs to the same operon ('1', class OP) or different operons ('0',
8656
class NOP). The resulting logistic regression model is stored in
8657
'model', which contains the weights beta_0, beta_1, and beta_2:
8659
[8.9830290157144681, -0.035968960444850887, 0.02181395662983519]
8662
Note that beta_1 is negative, as gene pairs with a shorter intergene
8663
distance have a higher probability of belonging to the same operon
8664
(class OP). On the other hand, beta_2 is positive, as gene pairs
8665
belonging to the same operon typically have a higher similarity score of
8666
their gene expression profiles. The parameter beta_0 is positive due to
8667
the higher prevalence of operon gene pairs than non-operon gene pairs in
8669
The function 'train' has two optional arguments: 'update_fn' and
8670
'typecode'. The 'update_fn' can be used to specify a callback function,
8671
taking as arguments the iteration number and the log-likelihood. With
8672
the callback function, we can for example track the progress of the
8673
model calculation (which uses a Newton-Raphson iteration to maximize the
8674
log-likelihood function of the logistic regression model):
8675
<<>>> def show_progress(iteration, loglikelihood):
8676
print "Iteration:", iteration, "Log-likelihood function:",
8679
>>> model = LogisticRegression.train(xs, ys, update_fn=show_progress)
8680
Iteration: 0 Log-likelihood function: -11.7835020695
8681
Iteration: 1 Log-likelihood function: -7.15886767672
8682
Iteration: 2 Log-likelihood function: -5.76877209868
8683
Iteration: 3 Log-likelihood function: -5.11362294338
8684
Iteration: 4 Log-likelihood function: -4.74870642433
8685
Iteration: 5 Log-likelihood function: -4.50026077146
8686
Iteration: 6 Log-likelihood function: -4.31127773737
8687
Iteration: 7 Log-likelihood function: -4.16015043396
8688
Iteration: 8 Log-likelihood function: -4.03561719785
8689
Iteration: 9 Log-likelihood function: -3.93073282192
8690
Iteration: 10 Log-likelihood function: -3.84087660929
8691
Iteration: 11 Log-likelihood function: -3.76282560605
8692
Iteration: 12 Log-likelihood function: -3.69425027154
8693
Iteration: 13 Log-likelihood function: -3.6334178602
8694
Iteration: 14 Log-likelihood function: -3.57900855837
8695
Iteration: 15 Log-likelihood function: -3.52999671386
8696
Iteration: 16 Log-likelihood function: -3.48557145163
8697
Iteration: 17 Log-likelihood function: -3.44508206139
8698
Iteration: 18 Log-likelihood function: -3.40799948447
8699
Iteration: 19 Log-likelihood function: -3.3738885624
8700
Iteration: 20 Log-likelihood function: -3.3423876581
8701
Iteration: 21 Log-likelihood function: -3.31319343769
8702
Iteration: 22 Log-likelihood function: -3.2860493346
8703
Iteration: 23 Log-likelihood function: -3.2607366863
8704
Iteration: 24 Log-likelihood function: -3.23706784091
8705
Iteration: 25 Log-likelihood function: -3.21488073614
8706
Iteration: 26 Log-likelihood function: -3.19403459259
8707
Iteration: 27 Log-likelihood function: -3.17440646052
8708
Iteration: 28 Log-likelihood function: -3.15588842703
8709
Iteration: 29 Log-likelihood function: -3.13838533947
8710
Iteration: 30 Log-likelihood function: -3.12181293595
8711
Iteration: 31 Log-likelihood function: -3.10609629966
8712
Iteration: 32 Log-likelihood function: -3.09116857282
8713
Iteration: 33 Log-likelihood function: -3.07696988017
8714
Iteration: 34 Log-likelihood function: -3.06344642288
8715
Iteration: 35 Log-likelihood function: -3.05054971191
8716
Iteration: 36 Log-likelihood function: -3.03823591619
8717
Iteration: 37 Log-likelihood function: -3.02646530573
8718
Iteration: 38 Log-likelihood function: -3.01520177394
8719
Iteration: 39 Log-likelihood function: -3.00441242601
8720
Iteration: 40 Log-likelihood function: -2.99406722296
8721
Iteration: 41 Log-likelihood function: -2.98413867259
8724
The iteration stops once the increase in the log-likelihood function
8725
is less than 0.01. If no convergence is reached after 500 iterations,
8726
the 'train' function returns with an 'AssertionError'.
8727
The optional keyword 'typecode' can almost always be ignored. This
8728
keyword allows the user to choose the type of Numeric matrix to use. In
8729
particular, to avoid memory problems for very large problems, it may be
8730
necessary to use single-precision floats (Float8, Float16, etc.) rather
8731
than double, which is used by default.
8734
12.1.3 Using the logistic regression model for classification
8735
==============================================================
8737
Classification is performed by calling the 'classify' function. Given
8738
a logistic regression model and the values for x_1 and x_2 (e.g. for a
8739
gene pair of unknown operon structure), the 'classify' function returns
8740
'1' or '0', corresponding to class OP and class NOP, respectively. For
8741
example, let's consider the gene pairs yxcE, yxcD and yxiB, yxiA:
8742
--------------------------------------------------------
8745
Table 12.2: Adjacent gene pairs of unknown operon status.
8747
----------------------------------------------------------------
8748
| Gene pair |Intergene distance x_1|Gene expression score x_2|
8749
----------------------------------------------------------------
8750
|yxcE --- yxcD| 6 | -173.143442352 |
8751
|yxiB --- yxiA| 309 | -271.005880394 |
8752
----------------------------------------------------------------
8755
--------------------------------------------------------
8757
The logistic regression model classifies yxcE, yxcD as belonging to
8758
the same operon (class OP), while yxiB, yxiA are predicted to belong to
8760
<<>>> print "yxcE, yxcD:", LogisticRegression.classify(model,
8763
>>> print "yxiB, yxiA:", LogisticRegression.classify(model, [309,
8767
(which, by the way, agrees with the biological literature).
8768
To find out how confident we can be in these predictions, we can call
8769
the 'calculate' function to obtain the probabilities (equations (12.2)
8770
and 12.3) for class OP and NOP. For yxcE, yxcD we find
8771
<<>>> q, p = LogisticRegression.calculate(model, [6,-173.143442352])
8772
>>> print "class OP: probability =", p, "class NOP: probability =", q
8773
class OP: probability = 0.993242163503 class NOP: probability =
8777
<<>>> q, p = LogisticRegression.calculate(model, [309, -271.005880394])
8778
>>> print "class OP: probability =", p, "class NOP: probability =", q
8779
class OP: probability = 0.000321211251817 class NOP: probability =
8783
To get some idea of the prediction accuracy of the logistic regression
8784
model, we can apply it to the training data:
8785
<<>>> for i in range(len(ys)):
8786
print "True:", ys[i], "Predicted:",
8787
LogisticRegression.classify(model, xs[i])
8788
True: 1 Predicted: 1
8789
True: 1 Predicted: 0
8790
True: 1 Predicted: 1
8791
True: 1 Predicted: 1
8792
True: 1 Predicted: 1
8793
True: 1 Predicted: 1
8794
True: 1 Predicted: 1
8795
True: 1 Predicted: 1
8796
True: 1 Predicted: 1
8797
True: 1 Predicted: 1
8798
True: 0 Predicted: 0
8799
True: 0 Predicted: 0
8800
True: 0 Predicted: 0
8801
True: 0 Predicted: 0
8802
True: 0 Predicted: 0
8803
True: 0 Predicted: 0
8804
True: 0 Predicted: 0
8806
showing that the prediction is correct for all but one of the gene
8807
pairs. A more reliable estimate of the prediction accuracy can be found
8808
from a leave-one-out analysis, in which the model is recalculated from
8809
the training data after removing the gene to be predicted:
8810
<<>>> for i in range(len(ys)):
8811
model = LogisticRegression.train(xs[:i]+xs[i+1:],
8813
print "True:", ys[i], "Predicted:",
8814
LogisticRegression.classify(model, xs[i])
8815
True: 1 Predicted: 1
8816
True: 1 Predicted: 0
8817
True: 1 Predicted: 1
8818
True: 1 Predicted: 1
8819
True: 1 Predicted: 1
8820
True: 1 Predicted: 1
8821
True: 1 Predicted: 1
8822
True: 1 Predicted: 1
8823
True: 1 Predicted: 1
8824
True: 1 Predicted: 1
8825
True: 0 Predicted: 0
8826
True: 0 Predicted: 0
8827
True: 0 Predicted: 0
8828
True: 0 Predicted: 0
8829
True: 0 Predicted: 1
8830
True: 0 Predicted: 0
8831
True: 0 Predicted: 0
8833
The leave-one-out analysis shows that the prediction of the logistic
8834
regression model is incorrect for only two of the gene pairs, which
8835
corresponds to a prediction accuracy of 88%.
8838
12.1.4 Logistic Regression, Linear Discriminant Analysis, and Support
8839
======================================================================
8843
The logistic regression model is similar to linear discriminant
8844
analysis. In linear discriminant analysis, the class probabilities also
8845
follow equations (12.2) and (12.3). However, instead of estimating the
8846
coefficients beta directly, we first fit a normal distribution to the
8847
predictor variables x. The coefficients beta are then calculated from
8848
the means and covariances of the normal distribution. If the
8849
distribution of x is indeed normal, then we expect linear discriminant
8850
analysis to perform better than the logistic regression model. The
8851
logistic regression model, on the other hand, is more robust to
8852
deviations from normality.
8853
Another similar approach is a support vector machine with a linear
8854
kernel. Such an SVM also uses a linear combination of the predictors,
8855
but estimates the coefficients beta from the predictor variables x near
8856
the boundary region between the classes. If the logistic regression
8857
model (equations (12.2) and (12.3)) is a good description for x away
8858
from the boundary region, we expect the logistic regression model to
8859
perform better than an SVM with a linear kernel, as it relies on more
8860
data. If not, an SVM with a linear kernel may perform better.
8861
Trevor Hastie, Robert Tibshirani, and Jerome Friedman: The Elements of
8862
Statistical Learning. Data Mining, Inference, and Prediction. Springer
8863
Series in Statistics, 2001. Chapter 4.4.
8866
12.2 k-Nearest Neighbors
8867
*=*=*=*=*=*=*=*=*=*=*=*=*
8872
12.2.1 Background and purpose
8873
==============================
8875
The k-nearest neighbors method is a supervised learning approach that
8876
does not need to fit a model to the data. Instead, data points are
8877
classified based on the categories of the k nearest neighbors in the
8879
In Biopython, the k-nearest neighbors method is available in
8880
'Bio.kNN'. To illustrate the use of the k-nearest neighbor method in
8881
Biopython, we will use the same operon data set as in section 12.1.
8884
12.2.2 Initializing a k-nearest neighbors model
8885
================================================
8887
Using the data in Table 12.1, we create and initialize a k-nearest
8888
neighbors model as follows:
8889
<<>>> from Bio import kNN
8891
>>> model = kNN.train(xs, ys, k)
8894
where 'xs' and 'ys' are the same as in Section 12.1.2. Here, 'k' is
8895
the number of neighbors k that will be considered for the
8896
classification. For classification into two classes, choosing an odd
8897
number for k lets you avoid tied votes. The function name 'train' is a
8898
bit of a misnomer, since no model training is done: this function simply
8899
stores 'xs', 'ys', and 'k' in 'model'.
8902
12.2.3 Using a k-nearest neighbors model for classification
8903
============================================================
8905
To classify new data using the k-nearest neighbors model, we use the
8906
'classify' function. This function takes a data point (x_1,x_2) and
8907
finds the k-nearest neighbors in the training data set 'xs'. The data
8908
point (x_1, x_2) is then classified based on which category ('ys')
8909
occurs most among the k neighbors.
8910
For the example of the gene pairs yxcE, yxcD and yxiB, yxiA, we find:
8911
<<>>> x = [6, -173.143442352]
8912
>>> print "yxcE, yxcD:", kNN.classify(model, x)
8914
>>> x = [309, -271.005880394]
8915
>>> print "yxiB, yxiA:", kNN.classify(model, x)
8918
In agreement with the logistic regression model, yxcE, yxcD are
8919
classified as belonging to the same operon (class OP), while yxiB, yxiA
8920
are predicted to belong to different operons.
8921
The 'classify' function lets us specify both a distance function and a
8922
weight function as optional arguments. The distance function affects
8923
which k neighbors are chosen as the nearest neighbors, as these are
8924
defined as the neighbors with the smallest distance to the query point
8925
(x, y). By default, the Euclidean distance is used. Instead, we could
8926
for example use the city-block (Manhattan) distance:
8927
<<>>> def cityblock(x1, x2):
8928
... assert len(x1)==2
8929
... assert len(x2)==2
8930
... distance = abs(x1[0]-x2[0]) + abs(x1[1]-x2[1])
8933
>>> x = [6, -173.143442352]
8934
>>> print "yxcE, yxcD:", kNN.classify(model, x, distance_fn =
8939
The weight function can be used for weighted voting. For example, we
8940
may want to give closer neighbors a higher weight than neighbors that
8942
<<>>> def weight(x1, x2):
8943
... assert len(x1)==2
8944
... assert len(x2)==2
8945
... return exp(-abs(x1[0]-x2[0]) - abs(x1[1]-x2[1]))
8947
>>> x = [6, -173.143442352]
8948
>>> print "yxcE, yxcD:", kNN.classify(model, x, weight_fn = weight)
8951
By default, all neighbors are given an equal weight.
8952
To find out how confident we can be in these predictions, we can call
8953
the 'calculate' function, which will calculate the total weight assigned
8954
to the classes OP and NOP. For the default weighting scheme, this
8955
reduces to the number of neighbors in each category. For yxcE, yxcD, we
8957
<<>>> x = [6, -173.143442352]
8958
>>> weight = kNN.calculate(model, x)
8959
>>> print "class OP: weight =", weight[0], "class NOP: weight =",
8961
class OP: weight = 0.0 class NOP: weight = 3.0
8963
which means that all three neighbors of 'x1', 'x2' are in the NOP
8964
class. As another example, for yesK, yesL we find
8965
<<>>> x = [117, -267.14]
8966
>>> weight = kNN.calculate(model, x)
8967
>>> print "class OP: weight =", weight[0], "class NOP: weight =",
8969
class OP: weight = 2.0 class NOP: weight = 1.0
8971
which means that two neighbors are operon pairs and one neighbor is a
8973
To get some idea of the prediction accuracy of the k-nearest neighbors
8974
approach, we can apply it to the training data:
8975
<<>>> for i in range(len(ys)):
8976
print "True:", ys[i], "Predicted:", kNN.classify(model, xs[i])
8977
True: 1 Predicted: 1
8978
True: 1 Predicted: 0
8979
True: 1 Predicted: 1
8980
True: 1 Predicted: 1
8981
True: 1 Predicted: 1
8982
True: 1 Predicted: 1
8983
True: 1 Predicted: 1
8984
True: 1 Predicted: 1
8985
True: 1 Predicted: 1
8986
True: 1 Predicted: 0
8987
True: 0 Predicted: 0
8988
True: 0 Predicted: 0
8989
True: 0 Predicted: 0
8990
True: 0 Predicted: 0
8991
True: 0 Predicted: 0
8992
True: 0 Predicted: 0
8993
True: 0 Predicted: 0
8995
showing that the prediction is correct for all but two of the gene
8996
pairs. A more reliable estimate of the prediction accuracy can be found
8997
from a leave-one-out analysis, in which the model is recalculated from
8998
the training data after removing the gene to be predicted:
8999
<<>>> for i in range(len(ys)):
9000
model = kNN.train(xs[:i]+xs[i+1:], ys[:i]+ys[i+1:])
9001
print "True:", ys[i], "Predicted:", kNN.classify(model, xs[i])
9002
True: 1 Predicted: 1
9003
True: 1 Predicted: 0
9004
True: 1 Predicted: 1
9005
True: 1 Predicted: 1
9006
True: 1 Predicted: 1
9007
True: 1 Predicted: 1
9008
True: 1 Predicted: 1
9009
True: 1 Predicted: 1
9010
True: 1 Predicted: 1
9011
True: 1 Predicted: 0
9012
True: 0 Predicted: 0
9013
True: 0 Predicted: 0
9014
True: 0 Predicted: 1
9015
True: 0 Predicted: 0
9016
True: 0 Predicted: 0
9017
True: 0 Predicted: 0
9018
True: 0 Predicted: 1
9020
The leave-one-out analysis shows that k-nearest neighbors model is
9021
correct for 13 out of 17 gene pairs, which corresponds to a prediction
9029
This section will describe the 'Bio.NaiveBayes' module.
9032
12.4 Maximum Entropy
9033
*=*=*=*=*=*=*=*=*=*=*
9036
This section will describe the 'Bio.MaximumEntropy' module.
9043
This section will describe the 'Bio.MarkovModel' and/or
9044
'Bio.HMM.MarkovModel' modules.
9047
Chapter 13 Graphics including GenomeDiagram
9048
**********************************************
9050
The 'Bio.Graphics' module depends on the third party Python library
9051
ReportLab (1). Although focused on producing PDF files, ReportLab can
9052
also create encapsulated postscript (EPS) and (SVG) files. In addition
9053
to these vector based images, provided certain further dependencies such
9054
as the Python Imaging Library (PIL) (2) are installed, ReportLab can
9055
also output bitmap images (including JPEG, PNG, GIF, BMP and PICT
9065
====================
9067
The 'Bio.Graphics.GenomeDiagram' module is a new addition to Biopython
9068
1.50, having previously been available as a separate Python module
9069
dependent on Biopython. GenomeDiagram is described in the Bioinformatics
9070
journal publication Pritchard et al. (2006),
9071
doi:10.1093/bioinformatics/btk021 (3), and there are several examples
9072
images and documentation for the old separate version available at
9073
http://bioinf.scri.ac.uk/lp/programs.php#genomediagram.
9074
As the name might suggest, GenomeDiagram was designed for drawing
9075
whole genomes, in particular prokaryotic genomes, either as linear
9076
diagrams (optionally broken up into fragments to fit better) or as
9077
circular wheel diagrams. It proves also well suited to drawing quite
9078
detailed figures for smaller genomes such as phage, plasmids or
9080
This module is easiest to use if you have your genome loaded as a
9081
'SeqRecord' object containing lots of 'SeqFeature' objects - for example
9082
as loaded from a GenBank file (see Chapters 4 and 5).
9085
13.1.2 Diagrams, tracks, feature-sets and features
9086
===================================================
9088
GenomeDiagram uses a nested set of objects. At the top level, you have
9089
a diagram object representing a sequence (or sequence region) along the
9090
horizontal axis (or circle). A diagram can contain one or more tracks,
9091
shown stacked vertically (or radially on circular diagrams). These will
9092
all have the same length and represent the same sequence region. You
9093
might use one track to show the gene locations, another to show
9094
regulatory regions, and a third track to show the GC percentage.
9095
The most commonly used type of track will contain features, bundled
9096
together in feature-sets. You might choose to use one feature-set for
9097
all your CDS features, and another for tRNA features. This isn't
9098
required - they can all go in the same feature-set, but it makes it
9099
easier to update the properties of just selected features (e.g. make all
9100
the tRNA features red).
9101
There are two main ways to build up a complete diagram. Firstly, the
9102
top down approach where you create a diagram object, and then using its
9103
methods add track(s), and use the track methods to add feature-set(s),
9104
and use their methods to add the features. Secondly, you can create the
9105
individual objects separately (in whatever order suits your code), and
9109
13.1.3 A top down example
9110
==========================
9112
We're going to draw a whole genome from a 'SeqRecord' object read in
9113
from a GenBank file (see Chapter 5). This example uses the pPCP1 plasmid
9114
from Yersinia pestis biovar Microtus, the file is included with the
9115
Biopython unit tests under the GenBank folder, or online
9116
NC_005816.gb (4) from our website.
9117
<<from reportlab.lib import colors
9118
from reportlab.lib.units import cm
9119
from Bio.Graphics import GenomeDiagram
9120
from Bio import SeqIO
9121
record = SeqIO.read(open("NC_005816.gb"), "genbank")
9124
We're using a top down approach, so after loading in our sequence we
9125
next create an empty diagram, then add an (empty) track, and to that add
9126
an (empty) feature set:
9127
<<gd_diagram = GenomeDiagram.Diagram("Yersinia pestis biovar Microtus
9129
gd_track_for_features = gd_diagram.new_track(1, name="Annotated
9131
gd_feature_set = gd_track_for_features.new_set()
9134
Now the fun part - we take each gene 'SeqFeature' object in our
9135
'SeqRecord', and use it to generate a feature on the diagram. We're
9136
going to color them blue, alternating between a dark blue and a light
9138
<<for feature in record.features:
9139
if feature.type != "gene" :
9140
#Exclude this feature
9142
if len(gd_feature_set) % 2 == 0 :
9145
color = colors.lightblue
9146
gd_feature_set.add_feature(feature, color=color, label=True)
9149
Now we come to actually making the output file. This happens in two
9150
steps, first we call the 'draw' method, which creates all the shapes
9151
using ReportLab objects. Then we call the 'write' method which renders
9152
these to the requested file format. Note you can output in multiple file
9154
<<gd_diagram.draw(format="linear", orientation="landscape",
9156
fragments=4, start=0, end=len(record))
9157
gd_diagram.write("plasmid_linear.pdf", "PDF")
9158
gd_diagram.write("plasmid_linear.eps", "EPS")
9159
gd_diagram.write("plasmid_linear.svg", "SVG")
9162
Also, provided you have the dependencies installed, you can also do
9163
bitmaps, for example:
9164
<<gd_diagram.write("plasmid_linear.png", "PNG")
9167
*images/plasmid_linear.png*
9168
Notice that the 'fragments' argument which we set to four controls
9169
how many pieces the genome gets broken up into.
9170
If you want to do a circular figure, then try this:
9171
<<gd_diagram.move_track(1,3) # move track to make an empty space in the
9173
gd_diagram.draw(format="circular", circular=True,
9174
pagesize=(20*cm,20*cm),
9175
start=0, end=len(record))
9176
gd_diagram.write("plasmid_circular.pdf", "PDF")
9179
*images/plasmid_circular.png*
9180
These figures are not very exciting, but we've only just got
9184
13.1.4 A bottom up example
9185
===========================
9186
Now let's produce exactly the same figures, but using the bottom up
9187
approach. This means we create the different objects directly (and this
9188
can be done in almost any order) and then combine them.
9189
<<from reportlab.lib import colors
9190
from reportlab.lib.units import cm
9191
from Bio.Graphics import GenomeDiagram
9192
from Bio import SeqIO
9193
record = SeqIO.read(open("NC_005816.gb"), "genbank")
9195
#Create the feature set and its feature objects,
9196
gd_feature_set = GenomeDiagram.FeatureSet()
9197
for feature in record.features:
9198
if feature.type != "gene" :
9199
#Exclude this feature
9201
if len(gd_feature_set) % 2 == 0 :
9204
color = colors.lightblue
9205
gd_feature_set.add_feature(feature, color=color, label=True)
9206
#(this for loop is the same as in the previous example)
9208
#Create a track, and a diagram
9209
gd_track_for_features = GenomeDiagram.Track(name="Annotated Features")
9210
gd_diagram = GenomeDiagram.Diagram("Yersinia pestis biovar Microtus
9213
#Now have to glue the bits together...
9214
gd_track_for_features.add_set(gd_feature_set)
9215
gd_diagram.add_track(gd_track_for_features, 1)
9218
You can now call the 'draw' and 'write' methods as before to produce a
9219
linear or circular diagram, using the code at the end of the top-down
9220
example above. The figures should be identical.
9223
13.1.5 Features without a SeqFeature
9224
=====================================
9226
In the above example we used a 'SeqRecord''s 'SeqFeature' objects to
9227
build our diagram (see also Section 4.3). Sometimes you won't have
9228
'SeqFeature' objects, but just the coordinates for a feature you want to
9229
draw. You have to create minimal 'SeqFeature' object, but this is easy:
9230
<<from Bio.SeqFeature import SeqFeature, FeatureLocation
9231
my_seq_feature = SeqFeature(FeatureLocation(50,100),strand=+1)
9234
For strand, use +1 for the forward strand, -1 for the reverse strand,
9235
and None for both. Here is a short self contained example:
9236
<<from Bio.SeqFeature import SeqFeature, FeatureLocation
9237
from Bio.Graphics import GenomeDiagram
9238
from reportlab.lib.units import cm
9240
gdd = GenomeDiagram.Diagram('Test Diagram')
9241
gdt_features = gdd.new_track(1, greytrack=False)
9242
gds_features = gdt_features.new_set()
9244
#Add three features to show the strand options,
9245
feature = SeqFeature(FeatureLocation(25, 125), strand=+1)
9246
gds_features.add_feature(feature, name="Forward", label=True)
9247
feature = SeqFeature(FeatureLocation(150, 250), strand=None)
9248
gds_features.add_feature(feature, name="Standless", label=True)
9249
feature = SeqFeature(FeatureLocation(275, 375), strand=-1)
9250
gds_features.add_feature(feature, name="Reverse", label=True)
9252
gdd.draw(format='linear', pagesize=(15*cm,4*cm), fragments=1,
9254
gdd.write("GD_labels_default.pdf", "pdf")
9257
The top part of the image in the next subsection shows the output
9258
(in the default feature color, pale green).
9259
Notice that we have used the name argument here to specify the caption
9260
text for these features. This is discussed in more detail next.
9263
13.1.6 Feature captions
9264
========================
9266
Recall we used the following (where feature was a 'SeqFeature' object)
9267
to add a feature to the diagram:
9268
<<gd_feature_set.add_feature(feature, color=color, label=True)
9271
In the example above the 'SeqFeature' annotation was used to pick a
9272
sensible caption for the features. By default the following possible
9273
entries under the 'SeqFeature' object's qualifiers dictionary are used:
9274
gene, label, name, locus_tag, and product. More simply, you can specify
9276
<<gd_feature_set.add_feature(feature, color=color, label=True, name="My
9280
In addition to the caption text for each feature's label, you can also
9281
choose the font, position (this defaults to the start of the sigil, you
9282
can also choose the middle or at the end) and orientation (for linear
9283
diagrams only, where this defaults to rotated by 45 degrees):
9284
<<#Large font, parallel with the track
9285
gd_feature_set.add_feature(feature, label=True, color="green",
9286
label_size=25, label_angle=0)
9288
#Very small font, perpendicular to the track (towards it)
9289
gd_feature_set.add_feature(feature, label=True, color="purple",
9290
label_position="end",
9291
label_size=4, label_angle=90)
9293
#Small font, perpendicular to the track (away from it)
9294
gd_feature_set.add_feature(feature, label=True, color="blue",
9295
label_position="middle",
9296
label_size=6, label_angle=-90)
9299
Combining each of these three fragments with the complete example in
9300
the previous section should give something like this:
9301
*images/GD_sigil_labels.png*
9303
We've not shown it here, but you can also set label_color to control
9304
the label's color (used in Section 13.1.8).
9305
You'll notice the default font is quite small - this makes sense
9306
because you will usually be drawing many (small) features on a page, not
9307
just a few large ones as shown here.
9310
13.1.7 Feature sigils
9311
======================
9313
The examples above have all just used the default sigil for the
9314
feature, a plain box, but you can also use arrows instead. Note this
9315
wasn't available in the last publically released standalone version of
9317
<<#Default uses a BOX sigil
9318
gd_feature_set.add_feature(feature)
9320
#You can make this explicit:
9321
gd_feature_set.add_feature(feature, sigil="BOX")
9323
#Or opt for an arrow:
9324
gd_feature_set.add_feature(feature, sigil="ARROW")
9327
The default arrows are show at the top of the next two images. The
9328
arrows fit into a bounding box (as given by the default BOX sigil).
9329
There are two additional options to adjust the shapes of the arrows,
9330
firstly the thickness of the arrow shaft, given as a proportion of the
9331
height of the bounding box:
9332
<<#Full height shafts, giving pointed boxes:
9333
gd_feature_set.add_feature(feature, sigil="ARROW", color="brown",
9334
arrowshaft_height=1.0)
9336
gd_feature_set.add_feature(feature, sigil="ARROW", color="teal",
9337
arrowshaft_height=0.2)
9338
#Or, very thin shafts:
9339
gd_feature_set.add_feature(feature, sigil="ARROW", color="darkgreen",
9340
arrowshaft_height=0.1)
9343
The results are shown below:
9344
*images/GD_sigil_arrow_shafts.png*
9346
Secondly, the length of the arrow head - given as a proportion of the
9347
height of the bounding box (defaulting to 0.5, or 50%):
9348
<<#Short arrow heads:
9349
gd_feature_set.add_feature(feature, sigil="ARROW", color="blue",
9350
arrowhead_length=0.25)
9351
#Or, longer arrow heads:
9352
gd_feature_set.add_feature(feature, sigil="ARROW", color="orange",
9354
#Or, very very long arrow heads (i.e. all head, no shaft, so
9356
gd_feature_set.add_feature(feature, sigil="ARROW", color="red",
9357
arrowhead_length=10000)
9360
The results are shown below:
9361
*images/GD_sigil_arrow_heads.png*
9365
13.1.8 A nice example
9366
======================
9368
Now let's return to the pPCP1 plasmid from Yersinia pestis biovar
9369
Microtus, and the top down approach used in Section 13.1.3, but take
9370
advantage of the sigil options we've now discussed. This time we'll use
9371
arrows for the genes, and overlay them with strandless features (as
9372
plain boxes) showing the position of some restriction digest sites.
9373
<<from reportlab.lib import colors
9374
from Bio.Graphics import GenomeDiagram
9375
from Bio import SeqIO
9376
from Bio.SeqFeature import SeqFeature, FeatureLocation
9378
record = SeqIO.read(open("NC_005816.gb"), "genbank")
9380
gd_diagram = GenomeDiagram.Diagram(record.id)
9381
gd_track_for_features = gd_diagram.new_track(1, name="Annotated
9383
gd_feature_set = gd_track_for_features.new_set()
9385
for feature in record.features:
9386
if feature.type != "gene" :
9387
#Exclude this feature
9389
if len(gd_feature_set) % 2 == 0 :
9392
color = colors.lightblue
9393
gd_feature_set.add_feature(feature, sigil="ARROW",
9394
color=color, label=True,
9395
label_size = 14, label_angle=0)
9397
#I want to include some strandless features, so for an example
9398
#will use EcoRI recognition sites etc.
9399
for site, name, color in [("GAATTC","EcoRI",colors.green),
9400
("CCCGGG","SmaI",colors.orange),
9401
("AAGCTT","HindIII",colors.red),
9402
("GGATCC","BamHI",colors.purple)] :
9405
index = record.seq.find(site, start=index)
9406
if index == -1 : break
9407
feature = SeqFeature(FeatureLocation(index, index+len(site)))
9408
gd_feature_set.add_feature(feature, color=color, name=name,
9409
label=True, label_size = 10,
9413
gd_diagram.draw(format="linear", pagesize='A4', fragments=4,
9414
start=0, end=len(record))
9415
gd_diagram.write("plasmid_linear_nice.pdf", "PDF")
9416
gd_diagram.write("plasmid_linear_nice.eps", "EPS")
9417
gd_diagram.write("plasmid_linear_nice.svg", "SVG")
9421
*images/plasmid_linear_nice.png*
9425
13.1.9 Further options
9426
=======================
9428
All the examples so far have used a single track, but you can have
9429
more than one track -- for example show the genes on one, and repeat
9430
regions on another. You can also enable tick marks to show the scale --
9431
after all every graph should show its units.
9432
Also, we have only used the 'FeatureSet' so far. GenomeDiagram also
9433
has a 'GraphSet' which can be used for show line graphs, bar charts and
9434
heat plots (e.g. to show plots of GC% on a track parallel to the
9436
These options are not covered here yet, so for now we refer you to the
9437
User Guide (PDF) (5) included with the standalone version of
9438
GenomeDiagram (but please read the next section first), and the
9442
13.1.10 Converting old code
9443
============================
9445
If you have old code written using the standalone version of
9446
GenomeDiagram, and you want to switch it over to using the new version
9447
included with Biopython then you will have to make a few changes - most
9448
importantly to your import statements.
9449
Also, the older version of GenomeDiagram used only the UK spellings of
9450
color and center (colour and centre). As part of the integration into
9451
Biopython, both forms can now be used for argument names. However, at
9452
some point in the future the UK spellings may be deprecated.
9453
For example, if you used to have:
9454
<<from GenomeDiagram import GDFeatureSet, GDDiagram
9455
gdd = GDDiagram("An example")
9458
you could just switch the import statements like this:
9459
<<from Bio.Graphics.GenomeDiagram import FeatureSet as GDFeatureSet,
9460
Diagram as GDDiagram
9461
gdd = GDDiagram("An example")
9464
and hopefully that should be enough. In the long term you might want
9465
to switch to the new names, but you would have to change more of your
9467
<<from Bio.Graphics.GenomeDiagram import FeatureSet, Diagram
9468
gdd = Diagram("An example")
9472
<<from Bio.Graphics import GenomeDiagram
9473
gdd = GenomeDiagram.Diagram("An example")
9477
If you run into difficulties, please ask on the Biopython mailing list
9478
for advice. One catch is that for Biopython 1.50, we have not yet
9479
included the old module 'GenomeDiagram.GDUtilities' yet. This included a
9480
number of GC% related functions, which will probably be merged under
9481
'Bio.SeqUtils' later on.
9488
The 'Bio.Graphics.BasicChromosome' module allows drawing of simple
9489
chromosomes. Here is a very simple example - for which we'll use
9490
Arabidopsis thaliana.
9491
I first downloaded the five sequenced chromosomes from the NCBI's FTP
9492
site ftp://ftp.ncbi.nlm.nih.gov/genomes/Arabidopsis_thaliana and then
9493
parsed them with 'Bio.SeqIO' to find out their lengths. You could use
9494
the GenBank files for this, but it is faster to use the FASTA files for
9495
the whole chromosomes:
9496
<<from Bio import SeqIO
9497
entries = [("Chr I","CHR_I/NC_003070.fna"),
9498
("Chr II","CHR_II/NC_003071.fna"),
9499
("Chr III","CHR_III/NC_003074.fna"),
9500
("Chr IV","CHR_IV/NC_003075.fna"),
9501
("Chr V","CHR_V/NC_003076.fna")]
9502
for (name, filename) in entries :
9503
record = SeqIO.read(open(filename),"fasta")
9504
print name, len(record)
9507
This gave the lengths of the five chromosomes, which we'll now use in
9508
the following short demonstration of the 'BasicChromosome' module:
9509
<<from Bio.Graphics import BasicChromosome
9511
entries = [("Chr I", 30432563),
9512
("Chr II", 19705359),
9513
("Chr III", 23470805),
9514
("Chr IV", 18585042),
9515
("Chr V", 26992728)]
9517
max_length = max([length for name, length in entries])
9519
chr_diagram = BasicChromosome.Organism()
9520
for name, length in entries :
9521
cur_chromosome = BasicChromosome.Chromosome(name)
9522
#Set the length, adding and extra 20 percent for the tolomeres:
9523
cur_chromosome.scale_num = max_length * 1.2
9525
#Add an opening telomere
9526
start = BasicChromosome.TelomereSegment()
9527
start.scale = 0.1 * max_length
9528
cur_chromosome.add(start)
9530
#Add a body - using bp as the scale length here.
9531
body = BasicChromosome.ChromosomeSegment()
9533
cur_chromosome.add(body)
9535
#Add a closing telomere
9536
end = BasicChromosome.TelomereSegment(inverted=True)
9537
end.scale = 0.1 * max_length
9538
cur_chromosome.add(end)
9540
#This chromosome is done
9541
chr_diagram.add(cur_chromosome)
9543
chr_diagram.draw("simple_chrom.pdf", "Arabidopsis thaliana")
9546
This should create a very simple PDF file, shown here:
9547
*images/simple_chrom.png*
9548
This example is deliberately short and sweet. One thing you might
9549
want to try is showing the location of features of interest - perhaps
9550
SNPs or genes. Currently the 'ChromosomeSegment' object doesn't support
9551
sub-segments which would be one approach. Instead, you must replace the
9552
single large segment with lots of smaller segments, maybe white ones for
9553
the boring regions, and colored ones for the regions of interest.
9554
-----------------------------------
9557
(1) http://www.reportlab.org
9559
(2) http://www.pythonware.com/products/pil/
9561
(3) http://dx.doi.org/10.1093/bioinformatics/btk021
9563
(4) http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.gb
9565
(5) http://bioinf.scri.ac.uk/lp/downloads/programs/genomediagram/usergu
9569
Chapter 14 Cookbook -- Cool things to do with it
9570
***************************************************
9572
Biopython now has two collections of "cookbook" examples -- this
9573
chapter (which has been included in this tutorial for many years and has
9574
gradually grown), and http://biopython.org/wiki/Category:Cookbook which
9575
is a user contributed collection on our wiki.
9576
We're trying to encourage Biopython users to contribute their own
9577
examples to the wiki. In addition to helping the community, one direct
9578
benefit of sharing an example like this is that you could also get some
9579
feedback on the code from other Biopython users and developers - which
9580
could help you improve all your Python code.
9581
In the long term, we may end up moving all of the examples in this
9582
chapter to the wiki, or elsewhere within the tutorial.
9585
14.1 Working with sequence files
9586
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
9589
This section shows some more examples of sequence input/output, using
9590
the 'Bio.SeqIO' module described in Chapter 5.
9593
14.1.1 Producing randomised genomes
9594
====================================
9596
Let's suppose you are looking at genome sequence, hunting for some
9597
sequence feature -- maybe extreme local GC% bias, or possible
9598
restriction digest sites. Once you've got your Python code working on
9599
the real genome it may be sensible to try running the same search on
9600
randomised versions of the same genome for statistical analysis (after
9601
all, any "features" you've found could just be there just by chance).
9602
For this discussion, we'll use the GenBank file for the pPCP1 plasmid
9603
from Yersinia pestis biovar Microtus. The file is included with the
9604
Biopython unit tests under the GenBank folder, or you can get it from
9605
our website, NC_005816.gb (1). This file contains one and only one
9606
record, so we can read it in as a 'SeqRecord' using the
9607
'Bio.SeqIO.read()' function:
9608
<<from Bio import SeqIO
9609
original_rec = SeqIO.read(open("NC_005816.gb"),"genbank")
9612
So, how can we generate a shuffled versions of the original sequence?
9613
I would use the built in Python 'random' module for this, in particular
9614
the function 'random.shuffle' -- but this works on a Python list. Our
9615
sequence is a 'Seq' object, so in order to shuffle it we need to turn it
9618
nuc_list = list(original_rec.seq)
9619
random.shuffle(nuc_list) #acts in situ!
9622
Now, in order to use 'Bio.SeqIO' to output the shuffled sequence, we
9623
need to construct a new 'SeqRecord' with a new 'Seq' object using this
9624
shuffled list. In order to do this, we need to turn the list of
9625
nucleotides (single letter strings) into a long string -- the standard
9626
Python way to do this is with the string object's join method.
9627
<<from Bio.Seq import Seq
9628
from Bio.SeqRecord import SeqRecord
9629
shuffled_rec = SeqRecord(Seq("".join(nuc_list),
9630
original_rec.seq.alphabet), \
9631
id="Shuffled", description="Based on %s" %
9635
Let's put all these pieces together to make a complete Python script
9636
which generates a single FASTA file containing 30 randomly shuffled
9637
versions of the original sequence.
9638
This first version just uses a big for loop and writes out the records
9639
one by one (using the 'SeqRecord''s format method described in
9642
from Bio.Seq import Seq
9643
from Bio.SeqRecord import SeqRecord
9644
from Bio import SeqIO
9646
original_rec = SeqIO.read(open("NC_005816.gb"),"genbank")
9648
handle = open("shuffled.fasta", "w")
9649
for i in range(30) :
9650
nuc_list = list(original_rec.seq)
9651
random.shuffle(nuc_list)
9652
shuffled_rec = SeqRecord(Seq("".join(nuc_list),
9653
original_rec.seq.alphabet), \
9654
id="Shuffled%i" % (i+1), \
9655
description="Based on %s" %
9657
handle.write(shuffled_rec.format("fasta"))
9661
Personally I prefer the following version using a function to shuffle
9662
the record and a generator expression instead of the for loop:
9664
from Bio.Seq import Seq
9665
from Bio.SeqRecord import SeqRecord
9666
from Bio import SeqIO
9668
def make_shuffle_record(record, new_id) :
9669
nuc_list = list(record.seq)
9670
random.shuffle(nuc_list)
9671
return SeqRecord(Seq("".join(nuc_list), record.seq.alphabet), \
9672
id=new_id, description="Based on %s" % original_rec.id)
9674
original_rec = SeqIO.read(open("NC_005816.gb"),"genbank")
9675
shuffled_recs = (make_shuffle_record(original_rec, "Shuffled%i" %
9678
handle = open("shuffled.fasta", "w")
9679
SeqIO.write(shuffled_recs, handle, "fasta")
9685
14.1.2 Translating a FASTA file of CDS entries
9686
===============================================
9687
Suppose you've got an input file of CDS entries for some organism,
9688
and you want to generate a new FASTA file containing their protein
9689
sequences. i.e. Take each nucleotide sequence from the original file,
9690
and translate it. Back in Section 3.8 we saw how to use the 'Seq'
9691
object's 'translate method', and the optional 'cds' argument which
9692
enables correct translation of alternative start codons.
9693
We can combine this with 'Bio.SeqIO' as shown in the reverse
9694
complement example in Section 5.4.2. The key point is that for each
9695
nucleotide 'SeqRecord', we need to create a protein 'SeqRecord' - and
9696
take care of naming it.
9697
You can write you own function to do this, choosing suitable protein
9698
identifiers for your sequences, and the appropriate genetic code. In
9699
this example we just use the default table and add a prefix to the
9701
<<from Bio.SeqRecord import SeqRecord
9702
def make_protein_record(nuc_record) :
9703
"""Returns a new SeqRecord with the translated sequence (default
9705
return SeqRecord(seq = nuc_record.seq.translate(cds=True), \
9706
id = "trans_" + nuc_record.id, \
9707
description = "translation of CDS, using default
9711
We can then use this function to turn the input nucleotide records
9712
into protein records ready for output. An elegant way and memory
9713
efficient way to do this is with a generator expression:
9714
<<from Bio import SeqIO
9716
proteins = (make_protein_record(nuc_rec) for nuc_rec in \
9717
SeqIO.parse(open("coding_sequences.fasta"), "fasta"))
9719
out_handle = open("translations.fasta", "w")
9720
SeqIO.write(proteins, out_handle, "fasta")
9724
This should work on any FASTA file of complete coding sequences. If
9725
you are working on partial coding sequences, you may prefer to use
9726
'nuc_record.seq.translate(to_stop=True)' in the example above, as this
9727
wouldn't check for a valid start codon etc.
9730
14.1.3 Simple quality filtering for FASTQ files
9731
================================================
9733
The FASTQ file format was introduced at Sanger and is now widely used
9734
for holding nucleotide sequencing reads together with their quality
9735
scores. FASTQ files (and the related QUAL files) are an excellent
9736
example of per-letter-annotation, because for each nucleotide in the
9737
sequence there is an associated quality score. Any per-letter-annotation
9738
is held in a 'SeqRecord' in the 'letter_annotations' dictionary as a
9739
list, tuple or string (with the same number of elements as the sequence
9741
One common task is taking a large set of sequencing reads and
9742
filtering them (or cropping them) based on their quality scores. The
9743
following example is very simplistic, but should illustrate the basics
9744
of working with quality data in a 'SeqRecord' object. All we are going
9745
to do here is read in a file of FASTQ data, and filter it to pick out
9746
only those records whose PHRED quality scores are all above some
9747
threshold (here 20).
9748
For this example we'll use some real data downloaded from the NCBI,
9749
ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX003/SRX003639/SRR014849.fastq.g
9750
z (8MB) which unzips to a 23MB file SRR014849.fastq.
9751
<<from Bio import SeqIO
9753
good_reads = (record for record in \
9754
SeqIO.parse(open("SRR014849.fastq"), "fastq") \
9755
if min(record.letter_annotations["phred_quality"]) >=
9758
out_handle = open("good_quality.fastq", "w")
9759
count = SeqIO.write(good_reads, out_handle, "fastq")
9761
print "Saved %i reads" % count
9764
This pulled out only 412 reads - maybe this dataset hasn't been
9765
quality trimmed yet?
9766
FASTQ files can contain millions of entries, so it is best to avoid
9767
loading them all into memory at once. This example uses a generator
9768
expression, which means only one 'SeqRecord' is created at a time -
9769
avoiding any memory limitations.
9772
14.1.4 Trimming off primer sequences
9773
=====================================
9775
For this example we're going to pretend that GTTGGAACCG is a 3' primer
9776
sequence we want to look for in some FASTQ formatted read data. As in
9777
the example above, we'll use the SRR014849.fastq file downloaded from
9779
(ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX003/SRX003639/SRR014849.fastq.
9780
gz). The same approach would work with any other supported file format
9782
This code uses 'Bio.SeqIO' with a generator expression (to avoid
9783
loading all the sequences into memory at once), and the 'Seq' object's
9784
'startswith' method to see if the read starts with the primer sequence:
9785
<<from Bio import SeqIO
9786
primer_reads = (record for record in \
9787
SeqIO.parse(open("SRR014849.fastq"), "fastq") \
9788
if record.seq.startswith("GTTGGAACCG"))
9789
out_handle = open("with_primer.fastq", "w")
9790
count = SeqIO.write(primer_reads, out_handle, "fastq")
9792
print "Saved %i reads" % count
9795
That should find 500 reads from SRR014849.fastq and save them to a new
9796
FASTQ file, with_primer.fastq.
9797
Now suppose that instead you wanted to make a FASTQ file containing
9798
these 500 reads but with the primer sequence removed? That's just a
9799
small change as we can slice the 'SeqRecord' (see Section 4.6) to remove
9800
the first ten letters (the length of our primer):
9801
<<from Bio import SeqIO
9802
trimmed_primer_reads = (record[10:] for record in \
9803
SeqIO.parse(open("SRR014849.fastq"), "fastq")
9805
if record.seq.startswith("GTTGGAACCG"))
9806
out_handle = open("with_primer_trimmed.fastq", "w")
9807
count = SeqIO.write(trimmed_primer_reads, out_handle, "fastq")
9809
print "Saved %i reads" % count
9812
Again, that should pull out the 500 reads from SRR014849.fastq, but
9813
this time strip off the first ten characters, and save them to another
9814
new FASTQ file, with_primer_trimmed.fastq.
9815
Finally, suppose you want to create a new FASTQ file where these 500
9816
reads have their primer removed, but all the other reads are kept as
9818
<<from Bio import SeqIO
9819
def trim_primer(record, primer) :
9820
if record.seq.startswith(primer) :
9821
return record[len(primer):]
9825
trimmed_reads = (trim_primer(record, "GTTGGAACCG") for record in \
9826
SeqIO.parse(open("SRR014849.fastq"), "fastq"))
9827
out_handle = open("trimmed.fastq", "w")
9828
count = SeqIO.write(trimmed_reads, out_handle, "fastq")
9830
print "Saved %i reads" % count
9833
This takes longer, as this time the output file contains all 94696
9834
reads. Again, we're used a generator expression to avoid any memory
9835
problems. Although it is slower, you might prefer to use a for loop:
9836
<<from Bio import SeqIO
9837
out_handle = open("trimmed.fastq", "w")
9838
for record in SeqIO.parse(open("SRR014849.fastq"),"fastq") :
9839
if record.seq.startswith("GTTGGAACCG") :
9840
out_handle.write(record[10:].format("fastq"))
9842
out_handle.write(record.format("fastq"))
9846
In this case the for loop looks simpler, but putting the trim logic
9847
into a function is more tidy, and makes it easier to adjust the trimming
9848
later on. For example, you might decide to look for a 5' primer as well.
9849
Or, as in the following example, look for the primer anywhere in the
9850
reads - not just at the beginning.
9853
14.1.5 Trimming off adaptor sequences
9854
======================================
9856
This is essentially a trivial extension to the previous example. We
9857
are going to going to pretend GTTGGAACCG is an adaptor sequence in some
9858
FASTQ formatted read data, again the SRR014849.fastq file from the NCBI
9859
(ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX003/SRX003639/SRR014849.fastq.
9861
This time however, we will look for the sequence anywhere in the
9862
reads, not just at the very beginning:
9863
<<from Bio import SeqIO
b'def trim_adaptor(record, adaptor) :'
9865
"""Removes adaptor sequences, looks for perfect matches only."""
2
9868
index = record.seq.find(adaptor)
9871
return record #not found, no trimming
9874
record[index+len(adaptor):]
b'trimmed_reads = (trim_adaptor(record,'
9875
"GTTGGAACCG") for record in \
9877
SeqIO.parse(open("SRR014849.fastq"), "fastq"))
9878
open("trimmed.fastq", "w")
b'count = SeqIO.write(trimmed_reads,'
9879
out_handle, "fastq")
b'print "Saved %i reads" %'
9883
Because we are using a FASTQ input file in this example, the
9884
'SeqRecord' objects have per-letter-annotation for the quality scores.
9885
By slicing the 'SeqRecord' object the appropriate scores are used on the
9886
trimmed records, so we can output them as a FASTQ file too.
9887
By changing the format names, you could apply this to FASTA files
9888
instead. This code also could be extended to do a fuzzy match instead of
9889
an exact match (maybe using a pairwise alignment, or taking into account
9890
the read quality scores), but that will be much slower.
9893
14.1.6 Converting FASTQ files
9894
==============================
9896
Back in Section 5.4.1 we showed how to use 'Bio.SeqIO' to convert
9897
between two file formats. Here we'll go into a little more detail
9898
regarding FASTQ files which are used in second generation DNA
9899
sequencing. FASTQ files store both the DNA sequence (as a string) and
9900
the associated read qualities.
9901
PHRED scores (used in some FASTQ files, and also in QUAL files and ACE
9902
files) have become a de facto standard for representing the probability
9903
of a sequencing error (here denoted by P_e) at a given base using a
9904
simple base ten log transformation:
9906
= - 10 X log ) (14.1)
9909
This means a wrong read (P_e = 1) gets a PHRED quality of 0, while a
9910
very good read like P_e = 0.00001 gets a PHRED quality of 50. While for
9911
raw sequencing data qualities higher than this are rare, with post
9912
processing such as read mapping or assembly, qualities of up to about 90
9913
are possible (indeed, the MAQ tool allows for PHRED scores in the range
9915
The FASTQ format has the potential to become a de facto standard for
9916
storing the letters and quality scores for a sequencing read in a single
9917
plain text file. The only fly in the ointment is that there are at least
9918
three versions of the FASTQ format which are incompatible and difficult
9922
1. The original Sanger FASTQ format uses PHRED qualities encoded with
9923
an ASCII offset of 33. The NCBI are using this format in their Short
9924
Read Archive. We call this the fastq (or fastq-sanger) format in
9926
2. Solexa (later bought by Illumina) introduced their own version
9927
using Solexa qualities encoded with an ASCII offset of 64. We call
9928
this the fastq-solexa format.
9929
3. Illumina pipeline 1.3 onwards produces FASTQ files with PHRED
9930
qualities (which is more consistent), but encoded with an ASCII
9931
offset of 64. We call this the fastq-illumina format.
9933
The Solexa quality scores are defined using a different log
9938
= - 10 X log (---- ) (14.2)
9943
Given Solexa/Illumina have now moved to using PHRED scores in version
9944
1.3 of their pipeline, the Solexa quality scores will gradually fall out
9945
of use. If you equate the error estimates (P_e) these two equations
9946
allow conversion between the two scoring systems - and Biopython
9947
includes functions to do this in the 'Bio.SeqIO.QualityIO' module, which
9948
are called if you use 'Bio.SeqIO' to convert an old Solexa/Illumina file
9949
into a standard Sanger FASTQ file:
9950
<<from Bio import SeqIO
9951
records = SeqIO.parse(open("solexa.fastq"), "fastq-solexa")
9952
out_handle = open("standard.fastq", "w")
9953
SeqIO.write(records, handle, "fastq")
9957
If you want to convert a new Illumina 1.3+ FASTQ file, all that gets
9958
changed is the ASCII offset because although encoded differently the
9959
scores are all PHRED qualities:
9960
<<from Bio import SeqIO
9961
records = SeqIO.parse(open("illumina.fastq"), "fastq-illumina")
9962
out_handle = open("standard.fastq", "w")
9963
SeqIO.write(records, handle, "fastq")
9967
For good quality reads, PHRED and Solexa scores are approximately
9968
equal, which means since both the fasta-solexa and fastq-illumina
9969
formats use an ASCII offset of 64 the files are almost the same. This
9970
was a deliberate design choice by Illumina, meaning applications
9971
expecting the old fasta-solexa style files will probably be OK using the
9972
newer fastq-illumina files (on good data). Of course, both variants are
9973
very different from the original fasta standard as used by Sanger, the
9974
NCBI, and elsewhere.
9975
For more details, see the built in help (also online (2)):
9976
<<>>> from Bio.SeqIO import QualityIO
9983
14.1.7 Identifying open reading frames
9984
=======================================
9986
A very simplistic first step at identifying possible genes is to look
9987
for open reading frames (ORFs). By this we mean look in all six frames
9988
for long regions without stop codons -- an ORF is just a region of
9989
nucleotides with no in frame stop codons.
9990
Of course, to find a gene you would also need to worry about locating
9991
a start codon, possible promoters -- and in Eukaryotes there are introns
9992
to worry about too. However, this approach is still useful in viruses
9994
To show how you might approach this with Biopython, we'll need a
9995
sequence to search, and as an example we'll again use the bacterial
9996
plasmid -- although this time we'll start with a plain FASTA file with
9997
no pre-marked genes: NC_005816.fna (3). This is a bacterial sequence, so
9998
we'll want to use NCBI codon table 11 (see Section 3.8 about
10000
<<from Bio import SeqIO
10001
record = SeqIO.read(open("NC_005816.fna"),"fasta")
10006
Here is a neat trick using the 'Seq' object's 'split' method to get a
10007
list of all the possible ORF translations in the six reading frames:
10008
<<for strand, nuc in [(+1, record.seq), (-1,
10009
record.seq.reverse_complement())] :
10010
for frame in range(3) :
10011
for pro in nuc[frame:].translate(table).split("*") :
10012
if len(pro) >= min_pro_len :
10013
print "%s...%s - length %i, strand %i, frame %i" \
10014
% (pro[:30], pro[-3:], len(pro), strand, frame)
10018
<<GCLMKKSSIVATIITILSGSANAASSQLIP...YRF - length 315, strand 1, frame 0
10019
KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE - length 285, strand 1, frame 1
10020
GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYH - length 176, strand 1, frame 1
10021
VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLP - length 165, strand 1, frame 1
10022
NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA - length 355, strand 1, frame 2
10023
RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTR - length 128, strand 1, frame 2
10024
TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIK - length 100, strand 1, frame 2
10025
QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD - length 114, strand -1, frame 0
10026
IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHV - length 111, strand -1, frame 0
10027
WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESK - length 125, strand -1, frame 1
10028
RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLK - length 361, strand -1, frame 1
10029
WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKR - length 111, strand -1, frame 1
10030
LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAA - length 107, strand -1, frame 2
10031
RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQ - length 119, strand -1, frame 2
10034
Note that here we are counting the frames from the 5' end (start) of
10035
each strand. It is sometimes easier to always count from the 5' end
10036
(start) of the forward strand.
10037
You could easily edit the above loop based code to build up a list of
10038
the candidate proteins, or convert this to a list comprehension. Now,
10039
one thing this code doesn't do is keep track of where the proteins are.
10040
You could tackle this in several ways. For example, the following code
10041
tracks the locations in terms of the protein counting, and converts back
10042
to the parent sequence by multiplying by three, then adjusting for the
10044
<<from Bio import SeqIO
10045
record = SeqIO.read(open("NC_005816.gb"),"genbank")
10049
def find_orfs_with_trans(seq, trans_table, min_protein_length) :
10052
for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())] :
10053
for frame in range(3) :
10054
trans = str(nuc[frame:].translate(trans_table))
10055
trans_len = len(trans)
10058
while aa_start < trans_len :
10059
aa_end = trans.find("*", aa_start)
10062
if aa_end-aa_start >= min_protein_length :
10064
start = frame+aa_start*3
10065
end = min(seq_len,frame+aa_end*3+3)
10067
start = seq_len-frame-aa_end*3-3
10068
end = seq_len-frame-aa_start*3
10070
answer.append((start, end, strand,
10071
trans[aa_start:aa_end]))
10072
aa_start = aa_end+1
10076
orf_list = find_orfs_with_trans(record.seq, table, min_pro_len)
10077
for start, end, strand, pro in orf_list :
10078
print "%s...%s - length %i, strand %i, %i:%i" \
10079
% (pro[:30], pro[-3:], len(pro), strand, start, end)
10083
<<NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA - length 355, strand 1, 41:1109
10084
WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKR - length 111, strand -1, 491:827
10085
KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE - length 285, strand 1, 1030:1888
10086
RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQ - length 119, strand -1,
10088
RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTR - length 128, strand 1, 3470:3857
10089
GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYH - length 176, strand 1, 4249:4780
10090
RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLK - length 361, strand -1,
10092
VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLP - length 165, strand 1, 5923:6421
10093
LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAA - length 107, strand -1,
10095
GCLMKKSSIVATIITILSGSANAASSQLIP...YRF - length 315, strand 1, 6654:7602
10096
IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHV - length 111, strand -1,
10098
WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESK - length 125, strand -1,
10100
TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIK - length 100, strand 1, 8741:9044
10101
QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD - length 114, strand -1,
10105
If you comment out the sort statement, then the protein sequences will
10106
be shown in the same order as before, so you can check this is doing the
10107
same thing. Here we have sorted them by location to make it easier to
10108
compare to the actual annotation in the GenBank file (as visualised in
10110
If however all you want to find are the locations of the open reading
10111
frames, then it is a waste of time to translate every possible codon,
10112
including doing the reverse complement to search the reverse strand too.
10113
All you need to do is search for the possible stop codons (and their
10114
reverse complements). Using regular expressions is an obvious approach
10115
here -- these are an extremely powerful (but rather complex) way of
10116
describing search strings, which are supported in lots of programming
10117
languages and also command line tools like grep as well). You can find
10118
whole books about this topic!
10121
14.2 Sequence parsing plus simple plots
10122
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
10125
This section shows some more examples of sequence parsing, using the
10126
'Bio.SeqIO' module described in Chapter 5, plus the Python library
10127
matplotlib's 'pylab' plotting interface (see the matplotlib website for
10128
a tutorial (4)). Note that to follow these examples you will need
10129
matplotlib installed - but without it you can still try the data parsing
10133
14.2.1 Histogram of sequence lengths
10134
=====================================
10136
There are lots of times when you might want to visualise the
10137
distribution of sequence lengths in a dataset -- for example the range
10138
of contig sizes in a genome assembly project. In this example we'll
10139
reuse our orchid FASTA file ls_orchid.fasta (5) which has only 94
10141
First of all, we will use 'Bio.SeqIO' to parse the FASTA file and
10142
compile a list of all the sequence lengths. You could do this with a for
10143
loop, but I find a list comprehension more pleasing:
10144
<<>>> from Bio import SeqIO
10145
>>> handle = open("ls_orchid.fasta")
10146
>>> sizes = [len(seq_record) for seq_record in SeqIO.parse(handle,
10149
>>> len(sizes), min(sizes), max(sizes)
10152
[740, 753, 748, 744, 733, 718, 730, 704, 740, 709, 700, 726, ..., 592]
10155
Now that we have the lengths of all the genes (as a list of integers),
10156
we can use the matplotlib histogram function to display it.
10157
<<from Bio import SeqIO
10158
handle = open("ls_orchid.fasta")
10159
sizes = [len(seq_record) for seq_record in SeqIO.parse(handle,
10164
pylab.hist(sizes, bins=20)
10165
pylab.title("%i orchid sequences\nLengths %i to %i" \
10166
% (len(sizes),min(sizes),max(sizes)))
10167
pylab.xlabel("Sequence length (bp)")
10168
pylab.ylabel("Count")
10172
That should pop up a new window containing the following graph:
10173
*images/hist_plot.png*
10174
Notice that most of these orchid sequences are about 740 bp long,
10175
and there could be two distinct classes of sequence here with a subset
10176
of shorter sequences.
10177
Tip: Rather than using 'pylab.show()' to show the plot in a window,
10178
you can also use 'pylab.savefig(...)' to save the figure to a file (e.g.
10182
14.2.2 Plot of sequence GC%
10183
============================
10185
Another easily calculated quantity of a nucleotide sequence is the
10186
GC%. You might want to look at the GC% of all the genes in a bacterial
10187
genome for example, and investigate any outliers which could have been
10188
recently acquired by horizontal gene transfer. Again, for this example
10189
we'll reuse our orchid FASTA file ls_orchid.fasta (6).
10190
First of all, we will use 'Bio.SeqIO' to parse the FASTA file and
10191
compile a list of all the GC percentages. Again, you could do this with
10192
a for loop, but I prefer the list comprehension used here:
10193
<<from Bio import SeqIO
10194
from Bio.SeqUtils import GC
10196
handle = open("ls_orchid.fasta")
10197
gc_values = [GC(seq_record.seq) for seq_record in SeqIO.parse(handle,
10203
Having read in each sequence and calculated the GC%, we then sorted
10204
them into ascending order. Now we'll take this list of floating point
10205
values and plot them with matplotlib:
10207
pylab.plot(gc_values)
10208
pylab.title("%i orchid sequences\nGC%% %0.1f to %0.1f" \
10209
% (len(gc_values),min(gc_values),max(gc_values)))
10210
pylab.xlabel("Genes")
10211
pylab.ylabel("GC%")
10215
As in the previous example, that should pop up a new window
10216
containing a graph:
10217
*images/gc_plot.png*
10218
If you tried this on the full set of genes from one organism, you'd
10219
probably get a much smoother plot than this.
10222
14.2.3 Nucleotide dot plots
10223
============================
10224
A dot plot is a way of visually comparing two nucleotide sequences
10225
for similarity to each other. A sliding window is used to compare short
10226
sub-sequences to each other, often with a mis-match threshold. Here for
10227
simplicity we'll only look for perfect matches (shown in black in the
10230
To start off, we'll need two sequences. For the sake of argument,
10231
we'll just take the first two from our orchid FASTA file
10232
ls_orchid.fasta (7):
10233
<<from Bio import SeqIO
10234
handle = open("ls_orchid.fasta")
10235
record_iterator = SeqIO.parse(handle, "fasta")
10236
rec_one = record_iterator.next()
10237
rec_two = record_iterator.next()
10241
We're going to show two approaches. Firstly, a simple naive
10242
implementation which compares all the window sized sub-sequences to each
10243
other to compiles a similarity matrix. You could construct a matrix or
10244
array object, but here we just use a list of lists of booleans created
10245
with a nested list comprehension:
10247
seq_one = str(rec_one.seq).upper()
10248
seq_two = str(rec_two.seq).upper()
10249
data = [[(seq_one[i:i+window] <> seq_two[j:j+window]) \
10250
for j in range(len(seq_one)-window)] \
10251
for i in range(len(seq_two)-window)]
10254
Note that we have not checked for reverse complement matches here. Now
10255
we'll use the matplotlib's 'pylab.imshow()' function to display this
10256
data, first requesting the gray color scheme so this is done in black
10261
pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one)))
10262
pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two)))
10263
pylab.title("Dot plot using window size %i\n(allowing no mis-matches)"
10268
That should pop up a new window containing a graph like this:
10269
*images/dot_plot.png*
10270
As you might have expected, these two sequences are very similar
10271
with a partial line of window sized matches along the diagonal. There
10272
are no off diagonal matches which would be indicative of inversions or
10273
other interesting events.
10274
The above code works fine on small examples, but there are two
10275
problems applying this to larger sequences, which we will address below.
10276
First off all, this brute force approach to the all against all
10277
comparisons is very slow. Instead, we'll compile dictionaries mapping
10278
the window sized sub-sequences to their locations, and then take the set
10279
intersection to find those sub-sequences found in both sequences. This
10280
uses more memory, but is much faster. Secondly, the 'pylab.imshow()'
10281
function is limited in the size of matrix it can display. As an
10282
alternative, we'll use the 'pylab.scatter()' function.
10283
We start by creating dictionaries mapping the window-sized
10284
sub-sequences to locations:
10288
for (seq, section_dict) in [(str(rec_one.seq).upper(), dict_one),
10289
(str(rec_two.seq).upper(), dict_two)] :
10290
for i in range(len(seq)-window) :
10291
section = seq[i:i+window]
10293
section_dict[section].append(i)
10295
section_dict[section] = [i]
10296
#Now find any sub-sequences found in both sequences
10297
#(Python 2.3 would require slightly different code here)
10298
matches = set(dict_one).intersection(dict_two)
10299
print "%i unique matches" % len(matches)
10301
In order to use the 'pylab.scatter()' we need separate lists for the x
10302
and y co-ordinates:
10303
<<#Create lists of x and y co-ordinates for scatter plot
10306
for section in matches :
10307
for i in dict_one[section] :
10308
for j in dict_two[section] :
10312
We are now ready to draw the revised dot plot as a scatter plot:
10314
pylab.cla() #clear any prior graph
10317
pylab.xlim(0, len(rec_one)-window)
10318
pylab.ylim(0, len(rec_two)-window)
10319
pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one)))
10320
pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two)))
10321
pylab.title("Dot plot using window size %i\n(allowing no mis-matches)"
10325
That should pop up a new window containing a graph like this:
10326
*images/dot_plot_scatter.png*
10327
Personally I find this second plot much easier to read! Again note
10328
that we have not checked for reverse complement matches here -- you
10329
could extend this example to do this, and perhaps plot the forward
10330
matches in one color and the reverse matches in another.
10333
14.2.4 Plotting the quality scores of sequencing read data
10334
===========================================================
10336
If you are working with second generation sequencing data, you may
10337
want to try plotting the quality data. Here is an example using two
10338
FASTQ files containing paired end reads, SRR001666_1.fastq for the
10339
forward reads, and SRR001666_2.fastq for the reverse reads. These were
10340
downloaded from the NCBI short read archive (SRA) FTP site (see the
10341
SRR001666 page (8) and
10342
ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX000/SRX000430/ for details).
10343
In the following code the 'pylab.subplot(...)' function is used in
10344
order to show the forward and reverse qualities on two subplots, side by
10345
side. There is also a little bit of code to only plot the first fifty
10348
from Bio import SeqIO
10349
for subfigure in [1,2] :
10350
filename = "SRR001666_%i.fastq" % subfigure
10351
pylab.subplot(1, 2, subfigure)
10352
for i,record in enumerate(SeqIO.parse(open(filename), "fastq")) :
10353
if i >= 50 : break #trick!
10354
pylab.plot(record.letter_annotations["phred_quality"])
10356
pylab.ylabel("PHRED quality score")
10357
pylab.xlabel("Position")
10358
pylab.savefig("SRR001666.png")
10362
You should note that we are using the 'Bio.SeqIO' format name fastq
10363
here because the NCBI has saved these reads using the standard Sanger
10364
FASTQ format with PHRED scores. However, as you might guess from the
10365
read lengths, this data was from an Illumina Genome Analyzer and was
10366
probably originally in one of the two Solexa/Illumina FASTQ variant file
10368
This example uses the 'pylab.savefig(...)' function instead of
10369
'pylab.show(...)', but as mentioned before both are useful. Here is
10371
*images/SRR001666.png*
10375
14.3 Dealing with alignments
10376
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
10379
This section can been seen as a follow on to Chapter 6.
10382
14.3.1 Calculating summary information
10383
=======================================
10385
Once you have an alignment, you are very likely going to want to find
10386
out information about it. Instead of trying to have all of the functions
10387
that can generate information about an alignment in the alignment object
10388
itself, we've tried to separate out the functionality into separate
10389
classes, which act on the alignment.
10390
Getting ready to calculate summary information about an object is
10391
quick to do. Let's say we've got an alignment object called 'alignment',
10392
for example read in using 'Bio.AlignIO.read(...)' as described in
10393
Chapter 6. All we need to do to get an object that will calculate
10394
summary information is:
10395
<<from Bio.Align import AlignInfo
10396
summary_align = AlignInfo.SummaryInfo(alignment)
10399
The 'summary_align' object is very useful, and will do the following
10400
neat things for you:
10403
1. Calculate a quick consensus sequence -- see section 14.3.2
10404
2. Get a position specific score matrix for the alignment -- see
10406
3. Calculate the information content for the alignment -- see
10408
4. Generate information on substitutions in the alignment --
10409
section 14.4 details using this to generate a substitution matrix.
10413
14.3.2 Calculating a quick consensus sequence
10414
==============================================
10416
The 'SummaryInfo' object, described in section 14.3.1, provides
10417
functionality to calculate a quick consensus of an alignment. Assuming
10418
we've got a 'SummaryInfo' object called 'summary_align' we can calculate
10419
a consensus by doing:
10420
<<consensus = summary_align.dumb_consensus()
10423
As the name suggests, this is a really simple consensus calculator,
10424
and will just add up all of the residues at each point in the consensus,
10425
and if the most common value is higher than some threshold value will
10426
add the common residue to the consensus. If it doesn't reach the
10427
threshold, it adds an ambiguity character to the consensus. The returned
10428
consensus object is Seq object whose alphabet is inferred from the
10429
alphabets of the sequences making up the consensus. So doing a 'print
10430
consensus' would give:
10432
Seq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT
10433
...', IUPACAmbiguousDNA())
10436
You can adjust how 'dumb_consensus' works by passing optional
10440
the threshold This is the threshold specifying how common a particular
10441
residue has to be at a position before it is added. The default is
10444
the ambiguous character This is the ambiguity character to use. The
10447
the consensus alphabet This is the alphabet to use for the consensus
10448
sequence. If an alphabet is not specified than we will try to guess
10449
the alphabet based on the alphabets of the sequences in the
10454
14.3.3 Position Specific Score Matrices
10455
========================================
10457
Position specific score matrices (PSSMs) summarize the alignment
10458
information in a different way than a consensus, and may be useful for
10459
different tasks. Basically, a PSSM is a count matrix. For each column in
10460
the alignment, the number of each alphabet letters is counted and
10461
totaled. The totals are displayed relative to some representative
10462
sequence along the left axis. This sequence may be the consesus
10463
sequence, but can also be any sequence in the alignment. For instance
10479
Let's assume we've got an alignment object called 'c_align'. To get a
10480
PSSM with the consensus sequence along the side we first get a summary
10481
object and calculate the consensus sequence:
10482
<<summary_align = AlignInfo.SummaryInfo(c_align)
10483
consensus = summary_align.dumb_consensus()
10486
Now, we want to make the PSSM, but ignore any 'N' ambiguity residues
10487
when calculating this:
10488
<<my_pssm = summary_align.pos_specific_score_matrix(consensus,
10493
Two notes should be made about this:
10496
1. To maintain strictness with the alphabets, you can only include
10497
characters along the top of the PSSM that are in the alphabet of the
10498
alignment object. Gaps are not included along the top axis of the
10501
2. The sequence passed to be displayed along the left side of the
10502
axis does not need to be the consensus. For instance, if you wanted
10503
to display the second sequence in the alignment along this axis, you
10505
<<second_seq = alignment.get_seq_by_num(1)
10506
my_pssm = summary_align.pos_specific_score_matrix(second_seq
10512
The command above returns a 'PSSM' object. To print out the PSSM as we
10513
showed above, we simply need to do a 'print my_pssm', which gives:
10526
You can access any element of the PSSM by subscripting like
10527
'your_pssm[sequence_number][residue_count_name]'. For instance, to get
10528
the counts for the 'A' residue in the second element of the above PSSM
10530
<<>>> print my_pssm[1]["A"]
10534
The structure of the PSSM class hopefully makes it easy both to access
10535
elements and to pretty print the matrix.
10538
14.3.4 Information Content
10539
===========================
10541
A potentially useful measure of evolutionary conservation is the
10542
information content of a sequence.
10543
A useful introduction to information theory targetted towards
10544
molecular biologists can be found at
10545
http://www.lecb.ncifcrf.gov/~toms/paper/primer/. For our purposes, we
10546
will be looking at the information content of a consesus sequence, or a
10547
portion of a consensus sequence. We calculate information content at a
10548
particular column in a multiple sequence alignment using the following
10562
- IC_j -- The information content for the j-th column in an
10564
- N_a -- The number of letters in the alphabet.
10565
- P_ij -- The frequency of a particular letter i in the j-th column
10566
(i. e. if G occured 3 out of 6 times in an aligment column, this
10568
- Q_i -- The expected frequency of a letter i. This is an optional
10569
argument, usage of which is left at the user's discretion. By
10570
default, it is automatically assigned to 0.05 = 1/20 for a protein
10571
alphabet, and 0.25 = 1/4 for a nucleic acid alphabet. This is for
10572
geting the information content without any assumption of prior
10573
distribtions. When assuming priors, or when using a non-standard
10574
alphabet, you should supply the values for Q_i.
10576
Well, now that we have an idea what information content is being
10577
calculated in Biopython, let's look at how to get it for a particular
10578
region of the alignment.
10579
First, we need to use our alignment to get a alignment summary object,
10580
which we'll assume is called 'summary_align' (see section 14.3.1) for
10581
instructions on how to get this. Once we've got this object, calculating
10582
the information content for a region is as easy as:
10583
<<info_content = summary_align.information_content(5, 30,
10588
Wow, that was much easier then the formula above made it look! The
10589
variable 'info_content' now contains a float value specifying the
10590
information content over the specified region (from 5 to 30 of the
10591
alignment). We specifically ignore the ambiguity residue 'N' when
10592
calculating the information content, since this value is not included in
10593
our alphabet (so we shouldn't be interested in looking at it!).
10594
As mentioned above, we can also calculate relative information content
10595
by supplying the expected frequencies:
10603
The expected should not be passed as a raw dictionary, but instead by
10604
passed as a 'SubsMat.FreqTable' object (see section 16.2.2 for more
10605
information about FreqTables). The FreqTable object provides a standard
10606
for associating the dictionary with an Alphabet, similar to how the
10607
Biopython Seq class works.
10608
To create a FreqTable object, from the frequency dictionary you just
10610
<<from Bio.Alphabet import IUPAC
10611
from Bio.SubsMat import FreqTable
10613
e_freq_table = FreqTable.FreqTable(expect_freq, FreqTable.FREQ,
10614
IUPAC.unambiguous_dna)
10617
Now that we've got that, calculating the relative information content
10618
for our region of the alignment is as simple as:
10619
<<info_content = summary_align.information_content(5, 30,
10626
Now, 'info_content' will contain the relative information content over
10627
the region in relation to the expected frequencies.
10628
The value return is calculated using base 2 as the logarithm base in
10629
the formula above. You can modify this by passing the parameter
10630
'log_base' as the base you want:
10631
<<info_content = summary_align.information_content(5, 30, log_base = 10,
10636
Well, now you are ready to calculate information content. If you want
10637
to try applying this to some real life problems, it would probably be
10638
best to dig into the literature on information content to get an idea of
10639
how it is used. Hopefully your digging won't reveal any mistakes made in
10640
coding this function!
10643
14.4 Substitution Matrices
10644
*=*=*=*=*=*=*=*=*=*=*=*=*=*
10647
Substitution matrices are an extremely important part of everyday
10648
bioinformatics work. They provide the scoring terms for classifying how
10649
likely two different residues are to substitute for each other. This is
10650
essential in doing sequence comparisons. The book "Biological Sequence
10651
Analysis" by Durbin et al. provides a really nice introduction to
10652
Substitution Matrices and their uses. Some famous substitution matrices
10653
are the PAM and BLOSUM series of matrices.
10654
Biopython provides a ton of common substitution matrices, and also
10655
provides functionality for creating your own substitution matrices.
10658
14.4.1 Using common substitution matrices
10659
==========================================
10663
14.4.2 Creating your own substitution matrix from an alignment
10664
===============================================================
10666
A very cool thing that you can do easily with the substitution matrix
10667
classes is to create your own substitution matrix from an alignment. In
10668
practice, this is normally done with protein alignments. In this
10669
example, we'll first get a Biopython alignment object and then get a
10670
summary object to calculate info about the alignment. The file
10671
containing protein.aln (9) (also available online here (10)) contains
10672
the Clustalw alignment output.
10673
<<from Bio import Clustalw
10674
from Bio.Alphabet import IUPAC
10675
from Bio.Align import AlignInfo
10677
# get an alignment object from a Clustalw alignment output
10678
c_align = Clustalw.parse_file("protein.aln", IUPAC.protein)
10679
summary_align = AlignInfo.SummaryInfo(c_align)
10682
Sections 6.3.1 and 14.3.1 contain more information on doing this.
10683
Now that we've got our 'summary_align' object, we want to use it to
10684
find out the number of times different residues substitute for each
10685
other. To make the example more readable, we'll focus on only amino
10686
acids with polar charged side chains. Luckily, this can be done easily
10687
when generating a replacement dictionary, by passing in all of the
10688
characters that should be ignored. Thus we'll create a dictionary of
10689
replacements for only charged polar amino acids using:
10690
<<replace_info = summary_align.replacement_dictionary(["G", "A", "V",
10698
This information about amino acid replacements is represented as a
10699
python dictionary which will look something like:
10700
<<{('R', 'R'): 2079.0, ('R', 'H'): 17.0, ('R', 'K'): 103.0, ('R', 'E'):
10702
('R', 'D'): 2.0, ('H', 'R'): 0, ('D', 'H'): 15.0, ('K', 'K'): 3218.0,
10703
('K', 'H'): 24.0, ('H', 'K'): 8.0, ('E', 'H'): 15.0, ('H', 'H'):
10705
('H', 'E'): 18.0, ('H', 'D'): 0, ('K', 'D'): 0, ('K', 'E'): 9.0,
10706
('D', 'R'): 48.0, ('E', 'R'): 2.0, ('D', 'K'): 1.0, ('E', 'K'): 45.0,
10707
('K', 'R'): 130.0, ('E', 'D'): 241.0, ('E', 'E'): 3305.0,
10708
('D', 'E'): 270.0, ('D', 'D'): 2360.0}
10711
This information gives us our accepted number of replacements, or how
10712
often we expect different things to substitute for each other. It turns
10713
out, amazingly enough, that this is all of the information we need to go
10714
ahead and create a substitution matrix. First, we use the replacement
10715
dictionary information to create an Accepted Replacement Matrix (ARM):
10716
<<from Bio import SubsMat
10717
my_arm = SubsMat.SeqMat(replace_info)
10720
With this accepted replacement matrix, we can go right ahead and
10721
create our log odds matrix (i. e. a standard type Substitution Matrix):
10722
<<my_lom = SubsMat.make_log_odds_matrix(my_arm)
10725
The log odds matrix you create is customizable with the following
10726
optional arguments:
10729
- 'exp_freq_table' -- You can pass a table of expected frequencies
10730
for each alphabet. If supplied, this will be used instead of the
10731
passed accepted replacement matrix when calculate expected
10734
- 'logbase' - The base of the logarithm taken to create the log odd
10735
matrix. Defaults to base 10.
10737
- 'factor' - The factor to multiply each matrix entry by. This
10738
defaults to 10, which normally makes the matrix numbers easy to work
10741
- 'round_digit' - The digit to round to in the matrix. This defaults
10742
to 0 (i. e. no digits).
10744
Once you've got your log odds matrix, you can display it prettily
10745
using the function 'print_mat'. Doing this on our created matrix gives:
10746
<<>>> my_lom.print_mat()
10755
Very nice. Now we've got our very own substitution matrix to play
10759
14.5 BioSQL -- storing sequences in a relational database
10760
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
10762
BioSQL (11) is a joint effort between the OBF (12) projects
10763
(BioPerl, BioJava etc) to support a shared database schema for storing
10764
sequence data. In theory, you could load a GenBank file into the
10765
database with BioPerl, then using Biopython extract this from the
10766
database as a record object with featues - and get more or less the same
10767
thing as if you had loaded the GenBank file directly as a SeqRecord
10768
using 'Bio.SeqIO' (Chapter 5).
10769
Biopython's BioSQL module is currently documented at
10770
http://biopython.org/wiki/BioSQL which is part of our wiki pages.
10777
The 'Bio.InterPro' module works with files from the InterPro database,
10778
which can be obtained from the InterPro project:
10779
http://www.ebi.ac.uk/interpro/.
10780
The 'Bio.InterPro.Record' contains all the information stored in an
10781
InterPro record. Its string representation also is a valid InterPro
10782
record, but it is NOT guaranteed to be equivalent to the record from
10783
which it was produced.
10784
The 'Bio.InterPro.Record' contains:
10802
-----------------------------------
10805
(1) http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.gb
10807
(2) http://www.biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO-module.h
10810
(3) http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.fna
10812
(4) http://matplotlib.sourceforge.net/
10814
(5) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta
10816
(6) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta
10818
(7) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta
10820
(8) http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=
10821
viewer&run=SRR001666
10823
(9) examples/protein.aln
10825
(10) http://biopython.org/DIST/docs/tutorial/examples/protein.aln
10827
(11) http://www.biosql.org/
10829
(12) http://open-bio.org/
10832
Chapter 15 The Biopython testing framework
10833
*********************************************
10835
Biopython has a regression testing framework (the file 'run_tests.py')
10836
based on unittest (1), the standard unit testing framework for Python.
10837
Providing comprehensive tests for modules is one of the most important
10838
aspects of making sure that the Biopython code is as bug-free as
10839
possible before going out. It also tends to be one of the most
10840
undervalued aspects of contributing. This chapter is designed to make
10841
running the Biopython tests and writing good test code as easy as
10842
possible. Ideally, every module that goes into Biopython should have a
10843
test (and should also have documentation!). All our developers, and
10844
anyone installing Biopython from source, are strongly encouraged to run
10848
15.1 Running the tests
10849
*=*=*=*=*=*=*=*=*=*=*=*
10852
When you download the Biopython source code, or check it out from our
10853
source code repository, you should find a subdirectory call 'Tests'.
10854
This contains the key script 'run_tests.py', lots of individual scripts
10855
named 'test_XXX.py', a subdirectory called 'output' and lots of other
10856
subdirectories which contain input files for the test suite.
10857
As part of building and installing Biopython you will typically run
10858
the full test suite at the command line from the Biopython source top
10859
level directory using the following:
10860
<<python setup.py test
10862
This is actually equivalent to going to the 'Tests' subdirectory and
10864
<<python run_tests.py
10867
You'll often want to run just some of the tests, and this is done like
10869
<<python run_tests.py test_SeqIO.py test_AlignIO.py
10871
When giving the list of tests, the '.py' extension is optional, so you
10872
can also just type:
10873
<<python run_tests.py test_SeqIO test_AlignIO
10875
To run the docstring tests (see section 15.3), you can use
10876
<<python run_tests.py doctest
10878
By default, 'run_tests.py' runs all tests, including the docstring
10880
If an individual test is failing, you can also try running it
10881
directly, which may give you more information.
10882
Importantly, note that the individual unit tests come in two types:
10884
- Simple print-and-compare scripts. These unit tests are essentially
10885
short example Python programs, which print out various output text.
10886
For a test file named 'test_XXX.py' there will be a matching text
10887
file called 'test_XXX' under the 'output' subdirectory which contains
10888
the expected output. All that the test framework does to is run the
10889
script, and check the output agrees.
10890
- Standard 'unittest'- based tests. These will 'import unittest' and
10891
then define 'unittest.TestCase' classes, each with one or more
10892
sub-tests as methods starting with 'test_' which check some specific
10893
aspect of the code. These tests should not print any output directly.
10895
Currently, about half of the Biopython tests are 'unittest'-style
10896
tests, and half are print-and-compare tests.
10897
Running a simple print-and-compare test directly will usually give
10898
lots of output on screen, but does not check the output matches the
10899
expected output. If the test is failing with an exception error, it
10900
should be very easy to locate where exactly the script is failing. For
10901
an example of a print-and-compare test, try:
10902
<<python test_SeqIO.py
10905
The 'unittest'-based tests instead show you exactly which
10906
sub-section(s) of the test are failing. For example,
10907
<<python test_Cluster.py
10913
*=*=*=*=*=*=*=*=*=*
10916
Let's say you want to write some tests for a module called 'Biospam'.
10917
This can be a module you wrote, or an existing module that doesn't have
10918
any tests yet. In the examples below, we assume that 'Biospam' is a
10919
module that does simple math.
10920
Each Biopython test can have three important files and directories
10924
1. 'test_Biospam.py' -- The actual test code for your module.
10925
2. 'Biospam' [optional]-- A directory where any necessary input files
10926
will be located. Any output files that will be generated should also
10927
be written here (and preferrably cleaned up after the tests are done)
10928
to prevent clogging up the main Tests directory.
10929
3. 'output/Biospam' -- [for print-and-compare tests only] This file
10930
contains the expected output from running 'test_Biospam.py'. This
10931
file is not needed for 'unittest'-style tests, since there the
10932
validation is done in the test script 'test_Biospam.py' itself.
10934
It's up to you to decide whether you want to write a print-and-compare
10935
test script or a 'unittest'-style test script. The important thing is
10936
that you cannot mix these two styles in a single test script.
10937
Particularly, don't use 'unittest' features in a print-and-compare test.
10938
Any script with a 'test_' prefix in the 'Tests' directory will be
10939
found and run by 'run_tests.py'. Below, we show an example test script
10940
'test_Biospam.py' both for a print-and-compare test and for a
10941
'unittest'-based test. If you put this script in the Biopython 'Tests'
10942
directory, then 'run_tests.py' will find it and execute the tests
10944
<<$ python run_tests.py
10946
test_AlignIO ... ok
10948
test_BioSQL_SeqIO ... ok
10949
test_Biospam ... ok
10951
test_Clustalw ... ok
10954
<<----------------------------------------------------------------------
10955
Ran 107 tests in 86.127 seconds
10960
15.2.1 Writing a print-and-compare test
10961
========================================
10963
A print-and-compare style test should be much simpler for beginners or
10964
novices to write - essentially it is just an example script using your
10966
Here is what you should do to make a print-and-compare test for the
10970
1. Write a script called 'test_Biospam.py'
10974
- This script should live in the Tests directory
10976
- The script should test all of the important functionality of the
10977
module (the more you test the better your test is, of course!).
10979
- Try to avoid anything which might be platform specific, such as
10980
printing floating point numbers without using an explicit
10981
formatting string to avoid having too many decimal places
10982
(different platforms can give very slightly different values).
10985
2. If the script requires files to do the testing, these should go in
10986
the directory Tests/Biospam (if you just need something generic, like
10987
a FASTA sequence file, or a GenBank record, try and use an existing
10988
sample input file instead).
10990
3. Write out the test output and verify the output to be correct.
10991
There are two ways to do this:
10998
- Run the script and write its output to a file. On UNIX
10999
(including Linux and Mac OS X) machines, you would do something
11000
like: 'python test_Biospam.py > test_Biospam' which would write
11001
the output to the file 'test_Biospam'.
11003
- Manually look at the file 'test_Biospam' to make sure the
11004
output is correct. When you are sure it is all right and there
11005
are no bugs, you need to quickly edit the 'test_Biospam' file
11006
so that the first line is: `'test_Biospam'' (no quotes).
11008
- copy the 'test_Biospam' file to the directory Tests/output
11014
- Run 'python run_tests.py -g test_Biospam.py'. The regression
11015
testing framework is nifty enough that it'll put the output in
11016
the right place in just the way it likes it.
11018
- Go to the output (which should be in
11019
'Tests/output/test_Biospam') and double check the output to
11020
make sure it is all correct.
11024
4. Now change to the Tests directory and run the regression tests
11025
with 'python run_tests.py'. This will run all of the tests, and you
11026
should see your test run (and pass!).
11028
5. That's it! Now you've got a nice test for your module ready to
11029
check in, or submit to Biopython. Congratulations!
11031
As an example, the 'test_Biospam.py' test script to test the
11032
'addition' and 'multiplication' functions in the 'Biospam' module could
11034
<<from Bio import Biospam
11036
print "2 + 3 =", Biospam.addition(2, 3)
11037
print "9 - 1 =", Biospam.addition(9, -1)
11038
print "2 * 3 =", Biospam.multiplication(2, 3)
11039
print "9 * (- 1) =", Biospam.multiplication(9, -1)
11042
We generate the corresponding output with 'python run_tests.py -g
11043
test_Biospam.py', and check the output file 'output/test_Biospam':
11051
Often, the difficulty with larger print-and-compare tests is to keep
11052
track which line in the output corresponds to which command in the test
11053
script. For this purpose, it is important to print out some markers to
11054
help you match lines in the input script with the generated output.
11057
15.2.2 Writing a unittest-based test
11058
=====================================
11060
We want all the modules in Biopython to have unit tests, and a simple
11061
print-and-compare test is better than no test at all. However, although
11062
there is a steeper learning curve, using the 'unittest' framework gives
11063
a more structured result, and if there is a test failure this can
11064
clearly pinpoint which part of the test is going wrong. The sub-tests
11065
can also be run individually which is helpful for testing or debugging.
11066
The 'unittest'-framework has been included with Python since version
11067
2.1, and is documented in the Python Library Reference (which I know you
11068
are keeping under your pillow, as recommended). There is also online
11069
documentaion for unittest (2). If you are familiar with the 'unittest'
11070
system (or something similar like the nose test framework), you
11071
shouldn't have any trouble. You may find looking at the existing example
11072
within Biopython helpful too.
11073
Here's a minimal 'unittest'-style test script for 'Biospam', which you
11074
can copy and paste to get started:
11076
from Bio import Biospam
11078
class BiospamTestAddition(unittest.TestCase):
11080
def test_addition1(self):
11081
result = Biospam.addition(2, 3)
11082
self.assertEqual(result, 5)
11084
def test_addition2(self):
11085
result = Biospam.addition(9, -1)
11086
self.assertEqual(result, 8)
11088
class BiospamTestDivision(unittest.TestCase):
11090
def test_division1(self):
11091
result = Biospam.division(3.0, 2.0)
11092
self.assertAlmostEqual(result, 1.5)
11094
def test_division2(self):
11095
result = Biospam.division(10.0, -2.0)
11096
self.assertAlmostEqual(result, -5.0)
11099
if __name__ == "__main__" :
11100
runner = unittest.TextTestRunner(verbosity = 2)
11101
unittest.main(testRunner=runner)
11104
In the division tests, we use 'assertAlmostEqual' instead of
11105
'assertEqual' to avoid tests failing due to roundoff errors; see the
11106
'unittest' chapter in the Python documentation for details and for other
11107
functionality available in 'unittest' (online reference (3)).
11108
These are the key points of 'unittest'-based tests:
11111
- Test cases are stored in classes that derive from
11112
'unittest.TestCase' and cover one basic aspect of your code
11114
- You can use methods 'setUp' and 'tearDown' for any repeated code
11115
which should be run before and after each test method. For example,
11116
the 'setUp' method might be used to create an instance of the object
11117
you are testing, or open a file handle. The 'tearDown' should do any
11118
"tidying up", for example closing the file handle.
11120
- The tests are prefixed with 'test_' and each test should cover one
11121
specific part of what you are trying to test. You can have as many
11122
tests as you want in a class.
11124
- At the end of the test script, you can use
11125
<<if __name__ == "__main__" :
11126
runner = unittest.TextTestRunner(verbosity = 2)
11127
unittest.main(testRunner=runner)
11129
to execute the tests when the script is run by itself (rather than
11130
imported from 'run_tests.py'). If you run this script, then you'll
11131
see something like the following:
11132
<<$ python test_BiospamMyModule.py
11133
test_addition1 (__main__.TestAddition) ... ok
11134
test_addition2 (__main__.TestAddition) ... ok
11135
test_division1 (__main__.TestDivision) ... ok
11136
test_division2 (__main__.TestDivision) ... ok
11138
-------------------------------------------------------------------
11140
Ran 4 tests in 0.059s
11146
- To indicate more clearly what each test is doing, you can add
11147
docstrings to each test. These are shown when running the tests,
11148
which can be useful information if a test is failing.
11150
from Bio import Biospam
11152
class BiospamTestAddition(unittest.TestCase):
11154
def test_addition1(self):
11155
"""An addition test"""
11156
result = Biospam.addition(2, 3)
11157
self.assertEqual(result, 5)
11159
def test_addition2(self):
11160
"""A second addition test"""
11161
result = Biospam.addition(9, -1)
11162
self.assertEqual(result, 8)
11164
class BiospamTestDivision(unittest.TestCase):
11166
def test_division1(self):
11167
"""Now let's check division"""
11168
result = Biospam.division(3.0, 2.0)
11169
self.assertAlmostEqual(result, 1.5)
11171
def test_division2(self):
11172
"""A second division test"""
11173
result = Biospam.division(10.0, -2.0)
11174
self.assertAlmostEqual(result, -5.0)
11177
if __name__ == "__main__" :
11178
runner = unittest.TextTestRunner(verbosity = 2)
11179
unittest.main(testRunner=runner)
11182
Running the script will now show you:
11183
<<$ python test_BiospamMyModule.py
11184
An addition test ... ok
11185
A second addition test ... ok
11186
Now let's check division ... ok
11187
A second division test ... ok
11189
-------------------------------------------------------------------
11191
Ran 4 tests in 0.001s
11196
If your module contains docstring tests (see section 15.3), you may
11197
want to include those in the tests to be run. You can do so as follows
11198
by modifying the code under 'if __name__ == "__main__":' to look like
11200
<<if __name__ == "__main__":
11202
unittest.TestLoader().loadTestsFromName("test_Biospam")
11203
doctest_suite = doctest.DocTestSuite(Biospam)
11204
suite = unittest.TestSuite((unittest_suite, doctest_suite))
11205
runner = unittest.TextTestRunner(sys.stdout, verbosity = 2)
11209
This is only relevant if you want to run the docstring tests when you
11210
exectute 'python test_Biospam.py'; with 'python run_tests.py', the
11211
docstring tests are run automatically (assuming they are included in the
11212
list of docstring tests in 'run_tests.py', see the section below).
11215
15.3 Writing doctests
11216
*=*=*=*=*=*=*=*=*=*=*=
11219
Python modules, classes and functions support built in documentation
11220
using docstrings. The doctest framework (4) (included with Python)
11221
allows the developer to embed working examples in the docstrings, and
11222
have these examples automatically tested.
11223
Currently only a small part of Biopython includes doctests. The
11224
'run_tests.py' script takes care of running the doctests. For this
11225
purpose, at the top of the 'run_tests.py' script is a manually compiled
11226
list of modules to test, which allows us to skip modules with optional
11227
external dependencies which may not be installed (e.g. the Reportlab and
11228
NumPy libraries). So, if you've added some doctests to the docstrings in
11229
a Biopython module, in order to have them included in the Biopython test
11230
suite, you must update 'run_tests.py' to include your module. Currently,
11231
the relevant part of 'run_tests.py' looks as follows:
11232
<<# This is the list of modules containing docstring tests.
11233
# If you develop docstring tests for other modules, please add
11234
# those modules here.
11235
DOCTEST_MODULES = ["Bio.Seq",
11238
"Bio.Align.Generic",
11240
"Bio.KEGG.Compound",
11245
#Silently ignore any doctests for modules requiring numpy!
11248
DOCTEST_MODULES.extend(["Bio.Statistics.lowess"])
11249
except ImportError:
11253
Note that we regard doctests primarily as documentation, so you should
11254
stick to typical usage. Generally complicated examples dealing with
11255
error conditions and the like would be best left to a dedicated unit
11257
Note that if you want to write doctests involving file parsing,
11258
defining the file location complicates matters. Ideally use relative
11259
paths assuming the code will be run from the 'Tests' directory, see the
11260
'Bio.SeqIO' doctests for an example of this.
11261
To run the docstring tests only, use
11262
<<$ python run_tests.py doctest
11265
-----------------------------------
11268
(1) http://docs.python.org/library/unittest.html
11270
(2) http://docs.python.org/library/unittest.html
11272
(3) http://docs.python.org/library/unittest.html
11274
(4) http://docs.python.org/library/doctest.html
11277
Chapter 16 Advanced
11278
**********************
11283
*=*=*=*=*=*=*=*=*=*
11286
Many of the older Biopython parsers were built around an
11287
event-oriented design that includes Scanner and Consumer objects.
11288
Scanners take input from a data source and analyze it line by line,
11289
sending off an event whenever it recognizes some information in the
11290
data. For example, if the data includes information about an organism
11291
name, the scanner may generate an 'organism_name' event whenever it
11292
encounters a line containing the name.
11293
Consumers are objects that receive the events generated by Scanners.
11294
Following the previous example, the consumer receives the
11295
'organism_name' event, and the processes it in whatever manner necessary
11296
in the current application.
11297
This is a very flexible framework, which is advantageous if you want
11298
to be able to parse a file format into more than one representation. For
11299
example, the 'Bio.GenBank' module uses this construct either 'SeqRecord'
11300
objects or file-format-specific record objects.
11301
More recently, many of the parsers added for 'Bio.SeqIO' and
11302
'Bio.AlignIO' take a much simpler approach, but only generate a single
11303
object representation ('SeqRecord' and 'Alignment' objects
11304
respectively). In some cases the 'Bio.SeqIO' parsers actually wrap
11305
another Biopython parser - for example, the 'Bio.SwissProt' parser
11306
produces SwissProt format specific record objects, which get converted
11307
into 'SeqRecord' objects.
11310
16.2 Substitution Matrices
11311
*=*=*=*=*=*=*=*=*=*=*=*=*=*
11319
This module provides a class and a few routines for generating
11320
substitution matrices, similar to BLOSUM or PAM matrices, but based on
11321
user-provided data.
11322
Additionally, you may select a matrix from MatrixInfo.py, a collection
11323
of established substitution matrices.
11324
<<class SeqMat(UserDict.UserDict)
11332
1. 'self.data': a dictionary in the form of '{(i1,j1):n1,
11333
(i1,j2):n2,...,(ik,jk):nk}' where i, j are alphabet letters, and n
11336
2. 'self.alphabet': a class as defined in Bio.Alphabet
11338
3. 'self.ab_list': a list of the alphabet's letters, sorted.
11339
Needed mainly for internal purposes
11341
4. 'self.sum_letters': a dictionary. '{i1: s1, i2: s2,...,in:sn}'
11344
1. i: an alphabet letter;
11345
2. s: sum of all values in a half-matrix for that letter;
11346
3. n: number of letters in alphabet.
11355
<<__init__(self,data=None,alphabet=None,
11356
mat_type=NOTYPE,mat_name='',build_later=0):
11362
1. 'data': can be either a dictionary, or another SeqMat
11364
2. 'alphabet': a Bio.Alphabet instance. If not provided,
11365
construct an alphabet from data.
11367
3. 'mat_type': type of matrix generated. One of the following:
11370
NOTYPE No type defined
11371
ACCREP Accepted Replacements Matrix
11372
OBSFREQ Observed Frequency Matrix
11373
EXPFREQ Expsected Frequency Matrix
11374
SUBS Substitution Matrix
11377
'mat_type' is provided automatically by some of SubsMat's
11380
4. 'mat_name': matrix name, such as "BLOSUM62" or "PAM250"
11382
5. 'build_later': default false. If true, user may supply only
11383
alphabet and empty dictionary, if intending to build the matrix
11384
later. this skips the sanity check of alphabet size vs. matrix
11389
<<entropy(self,obs_freq_mat)
11394
1. 'obs_freq_mat': an observed frequency matrix. Returns the
11395
matrix's entropy, based on the frequency in 'obs_freq_mat'. The
11396
matrix instance should be LO or SUBS.
11400
<<letter_sum(self,letter)
11403
Returns the sum of all values in the matrix, for the provided
11407
<<all_letters_sum(self)
11410
Fills the dictionary attribute 'self.sum_letters' with the sum of
11411
values for each letter in the matrix's alphabet.
11414
<<print_mat(self,f,format="%4d",bottomformat="%4s",alphabet=None)
11417
prints the matrix to file handle f. 'format' is the format field for
11418
the matrix values; 'bottomformat' is the format field for the
11419
bottom row, containing matrix letters. Example output for a
11420
3-letter alphabet matrix:
11427
The 'alphabet' optional argument is a string of all characters in
11428
the alphabet. If supplied, the order of letters along the axes is
11429
taken from the string, rather than by alphabetical order.
11433
The following section is layed out in the order by which most people
11434
wish to generate a log-odds matrix. Of course, interim matrices can
11435
be generated and investigated. Most people just want a log-odds
11436
matrix, that's all.
11440
1. Generating an Accepted Replacement Matrix
11441
Initially, you should generate an accepted replacement matrix (ARM)
11442
from your data. The values in ARM are the counted number of
11443
replacements according to your data. The data could be a set of
11444
pairs or multiple alignments. So for instance if Alanine was
11445
replaced by Cysteine 10 times, and Cysteine by Alanine 12 times,
11446
the corresponding ARM entries would be:
11447
<<('A','C'): 10, ('C','A'): 12
11450
as order doesn't matter, user can already provide only one entry:
11454
A SeqMat instance may be initialized with either a full (first
11455
method of counting: 10, 12) or half (the latter method, 22)
11456
matrices. A full protein alphabet matrix would be of the size
11457
20x20 = 400. A half matrix of that alphabet would be 20x20/2 +
11458
20/2 = 210. That is because same-letter entries don't change. (The
11459
matrix diagonal). Given an alphabet size of N:
11462
1. Full matrix size:N*N
11464
2. Half matrix size: N(N+1)/2
11466
The SeqMat constructor automatically generates a half-matrix, if a
11467
full matrix is passed. If a half matrix is passed, letters in the
11468
key should be provided in alphabetical order: ('A','C') and not
11470
At this point, if all you wish to do is generate a log-odds matrix,
11471
please go to the section titled Example of Use. The following text
11472
describes the nitty-gritty of internal functions, to be used by
11473
people who wish to investigate their nucleotide/amino-acid
11474
frequency data more thoroughly.
11476
2. Generating the observed frequency matrix (OFM)
11478
<<OFM = SubsMat._build_obs_freq_mat(ARM)
11481
The OFM is generated from the ARM, only instead of replacement
11482
counts, it contains replacement frequencies.
11484
3. Generating an expected frequency matrix (EFM)
11486
<<EFM = SubsMat._build_exp_freq_mat(OFM,exp_freq_table)
11491
1. 'exp_freq_table': should be a FreqTable instance. See
11492
section 16.2.2 for detailed information on FreqTable. Briefly,
11493
the expected frequency table has the frequencies of appearance
11494
for each member of the alphabet. It is implemented as a
11495
dictionary with the alphabet letters as keys, and each letter's
11496
frequency as a value. Values sum to 1.
11498
The expected frequency table can (and generally should) be generated
11499
from the observed frequency matrix. So in most cases you will
11500
generate 'exp_freq_table' using:
11501
<<>>> exp_freq_table = SubsMat._exp_freq_table_from_obs_freq(OFM)
11502
>>> EFM = SubsMat._build_exp_freq_mat(OFM,exp_freq_table)
11505
But you can supply your own 'exp_freq_table', if you wish
11507
4. Generating a substitution frequency matrix (SFM)
11509
<<SFM = SubsMat._build_subs_mat(OFM,EFM)
11512
Accepts an OFM, EFM. Provides the division product of the
11513
corresponding values.
11515
5. Generating a log-odds matrix (LOM)
11517
<<LOM=SubsMat._build_log_odds_mat(SFM[,logbase=10,factor=10.0,roun
11525
2. 'logbase': base of the logarithm used to generate the
11528
3. 'factor': factor used to multiply the log-odds values. Each
11529
entry is generated by log(LOM[key])*factor And rounded to the
11530
'round_digit' place after the decimal point, if required.
11535
As most people would want to generate a log-odds matrix, with minimum
11536
hassle, SubsMat provides one function which does it all:
11537
<<make_log_odds_matrix(acc_rep_mat,exp_freq_table=None,logbase=10,
11538
factor=10.0,round_digit=0):
11543
1. 'acc_rep_mat': user provided accepted replacements matrix
11544
2. 'exp_freq_table': expected frequencies table. Used if provided,
11545
if not, generated from the 'acc_rep_mat'.
11546
3. 'logbase': base of logarithm for the log-odds matrix. Default
11548
4. 'round_digit': number after decimal digit to which result
11549
should be rounded. Default zero.
11557
<<FreqTable.FreqTable(UserDict.UserDict)
11566
1. 'alphabet': A Bio.Alphabet instance.
11567
2. 'data': frequency dictionary
11568
3. 'count': count dictionary (in case counts are provided).
11573
1. 'read_count(f)': read a count file from stream f. Then convert
11575
2. 'read_freq(f)': read a frequency data file from stream f. Of
11576
course, we then don't have the counts, but it is usually the
11577
letter frquencies which are interesting.
11580
3. Example of use: The expected count of the residues in the database
11581
is sitting in a file, whitespace delimited, in the following format
11582
(example given for a 3-letter alphabet):
11588
And will be read using the 'FreqTable.read_count(file_handle)'
11590
An equivalent frequency file:
11596
Conversely, the residue frequencies or counts can be passed as a
11597
dictionary. Example of a count dictionary (3-letter alphabet):
11598
<<{'A': 35, 'B': 65, 'C': 100}
11601
Which means that an expected data count would give a 0.5 frequency for
11602
'C', a 0.325 probability of 'B' and a 0.175 probability of 'A' out of
11603
200 total, sum of A, B and C)
11604
A frequency dictionary for the same data would be:
11605
<<{'A': 0.175, 'B': 0.325, 'C': 0.5}
11609
When passing a dictionary as an argument, you should indicate whether
11610
it is a count or a frequency dictionary. Therefore the FreqTable
11611
class constructor requires two arguments: the dictionary itself, and
11612
FreqTable.COUNT or FreqTable.FREQ indicating counts or frequencies,
11614
Read expected counts. readCount will already generate the frequencies
11615
Any one of the following may be done to geerate the frequency table
11617
<<>>> from SubsMat import *
11619
FreqTable.FreqTable(my_frequency_dictionary,FreqTable.FREQ)
11620
>>> ftab = FreqTable.FreqTable(my_count_dictionary,FreqTable.COUNT)
11621
>>> ftab = FreqTable.read_count(open('myCountFile'))
11622
>>> ftab = FreqTable.read_frequency(open('myFrequencyFile'))
11628
Chapter 17 Where to go from here -- contributing to Biopython
11629
****************************************************************
11633
17.1 Bug Reports + Feature Requests
11634
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
11637
Getting feedback on the Biopython modules is very important to us.
11638
Open-source projects like this benefit greatly from feedback,
11639
bug-reports (and patches!) from a wide variety of contributors.
11640
The main forums for discussing feature requests and potential bugs are
11641
the Biopython mailing lists (1):
11644
- biopython@biopython.org -- An unmoderated list for discussion of
11645
anything to do with Biopython.
11647
- biopython-dev@biopython.org -- A more development oriented list
11648
that is mainly used by developers (but anyone is free to
11651
Additionally, if you think you've found a bug, you can submit it to
11652
our bug-tracking page at http://bugzilla.open-bio.org/. This way, it
11653
won't get buried in anyone's Inbox and forgotten about.
11656
17.2 Mailing lists and helping newcomers
11657
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
11660
We encourage all our uses to sign up to the main Biopython mailing
11661
list. Once you've got the hang of an area of Biopython, we'd encourage
11662
you to help answer questions from beginners. After all, you were a
11666
17.3 Contributing Documentation
11667
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
11670
We're happy to take feedback or contributions - either via a
11671
bug-report or on the Mailing List. While reading this tutorial, perhaps
11672
you noticed some topics you were interested in which were missing, or
11673
not clearly explained. There is also Biopython's built in documentation
11674
(the docstrings, these are also online (2)), where again, you may be
11675
able to help fill in any blanks.
11678
17.4 Contributing cookbook examples
11679
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
11681
As explained in Chapter 14, Biopython now has a wiki collection of
11682
user contributed "cookbook" examples,
11683
http://biopython.org/wiki/Category:Cookbook -- maybe you can add to
11687
17.5 Maintaining a distribution for a platform
11688
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
11691
We currently provide source code archives (suitable for any OS, if you
11692
have the right build tools installed), and Windows Installers which are
11693
just click and run. This covers all the major operating systems.
11694
Most major Linux distributions have volunteers who take these source
11695
code releases, and compile them into packages for Linux users to easily
11696
install (taking care of dependencies etc). This is really great and we
11697
are of course very grateful. If you would like to contribute to this
11698
work, please find out more about how your Linux distribution handles
11700
Below are some tips for certain platforms to maybe get people started
11705
Windows -- Windows products typically have a nice graphical installer
11706
that installs all of the essential components in the right place. We
11707
use Distutils to create a installer of this type fairly easily.
11708
You must first make sure you have a C compiler on your Windows
11709
computer, and that you can compile and install things (this is the
11710
hard bit - see the Biopython installation instructions for info on
11712
Once you are setup with a C compiler, making the installer just
11714
<<python setup.py bdist_wininst
11717
Now you've got a Windows installer. Congrats! At the moment we have no
11718
trouble shipping installers built on 32 bit windows. If anyone would
11719
like to look into supporting 64 bit Windows that would be great.
11721
RPMs -- RPMs are pretty popular package systems on some Linux
11722
platforms. There is lots of documentation on RPMs available at
11723
http://www.rpm.org to help you get started with them. To create an
11724
RPM for your platform is really easy. You just need to be able to
11725
build the package from source (having a C compiler that works is thus
11726
essential) -- see the Biopython installation instructions for more
11728
To make the RPM, you just need to do:
11729
<<python setup.py bdist_rpm
11732
This will create an RPM for your specific platform and a source RPM in
11733
the directory 'dist'. This RPM should be good and ready to go, so
11734
this is all you need to do! Nice and easy.
11736
Macintosh -- Since Apple moved to Mac OS X, things have become much
11737
easier on the Mac. We generally treat it as just another Unix
11738
variant, and installing Biopython from source is just as easy as on
11739
Linux. The easiest way to get all the GCC compilers etc installed is
11740
to install Apple's X-Code. We might be able to provide click and run
11741
installers for Mac OS X, but to date there hasn't been any demand.
11743
Once you've got a package, please test it on your system to make sure
11744
it installs everything in a good way and seems to work properly. Once
11745
you feel good about it, send it off to one of the Biopython developers
11746
(write to our main mailing list at biopython@biopython.org if you're not
11747
sure who to send it to) and you've done it. Thanks!
11750
17.6 Contributing Unit Tests
11751
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
11754
Even if you don't have any new functionality to add to Biopython, but
11755
you want to write some code, please consider extending our unit test
11756
coverage. We've devoted all of Chapter 15 to this topic.
11759
17.7 Contributing Code
11760
*=*=*=*=*=*=*=*=*=*=*=*
11763
There are no barriers to joining Biopython code development other than
11764
an interest in creating biology-related code in Python. The best place
11765
to express an interest is on the Biopython mailing lists -- just let us
11766
know you are interested in coding and what kind of stuff you want to
11767
work on. Normally, we try to have some discussion on modules before
11768
coding them, since that helps generate good ideas -- then just feel free
11769
to jump right in and start coding!
11770
The main Biopython release tries to be fairly uniform and
11771
interworkable, to make it easier for users. You can read about some of
11772
(fairly informal) coding style guidelines we try to use in Biopython in
11773
the contributing documentation at
11774
http://biopython.org/wiki/Contributing. We also try to add code to the
11775
distribution along with tests (see Chapter 15 for more info on the
11776
regression testing framework) and documentation, so that everything can
11777
stay as workable and well documented as possible (including docstrings).
11778
This is, of course, the most ideal situation, under many situations
11779
you'll be able to find other people on the list who will be willing to
11780
help add documentation or more tests for your code once you make it
11781
available. So, to end this paragraph like the last, feel free to start
11783
Please note that to make a code contribution you must have the legal
11784
right to contribute it and license it under the Biopython license. If
11785
you wrote it all yourself, and it is not based on any other code, this
11786
shouldn't be a problem. However, there are issues if you want to
11787
contribute a derivative work - for example something based on GPL or
11788
LPGL licenced code would not be compatible with our license. If you have
11789
any queries on this, please discuss the issue on the biopython-dev
11791
Another point of concern for any additions to Biopython regards any
11792
build time or run time dependencies. Generally speaking, writing code to
11793
interact with a standalone tool (like BLAST, EMBOSS or ClustalW) doesn't
11794
present a big problem. However, any dependency on another library - even
11795
a Python library (especially one needed in order to compile and install
11796
Biopython like NumPy) would need further discussion.
11797
Additionally, if you have code that you don't think fits in the
11798
distribution, but that you want to make available, we maintain Script
11799
Central (http://biopython.org/wiki/Scriptcentral) which has pointers to
11800
freely available code in Python for bioinformatics.
11801
Hopefully this documentation has got you excited enough about
11802
Biopython to try it out (and most importantly, contribute!). Thanks for
11803
reading all the way through!
11804
-----------------------------------
11807
(1) http://biopython.org/wiki/Mailing_lists
11809
(2) http://biopython.org/DIST/docs/api
11812
Chapter 18 Appendix: Useful stuff about Python
11813
*************************************************
11815
If you haven't spent a lot of time programming in Python, many
11816
questions and problems that come up in using Biopython are often related
11817
to Python itself. This section tries to present some ideas and code that
11818
come up often (at least for us!) while using the Biopython libraries. If
11819
you have any suggestions for useful pointers that could go here, please
11823
18.1 What the heck is a handle?
11824
*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
11827
Handles are mentioned quite frequently throughout this documentation,
11828
and are also fairly confusing (at least to me!). Basically, you can
11829
think of a handle as being a "wrapper" around text information.
11830
Handles provide (at least) two benefits over plain text information:
11833
1. They provide a standard way to deal with information stored in
11834
different ways. The text information can be in a file, or in a string
11835
stored in memory, or the output from a command line program, or at
11836
some remote website, but the handle provides a common way of dealing
11837
with information in all of these formats.
11839
2. They allow text information to be read incrementally, instead of
11840
all at once. This is really important when you are dealing with huge
11841
text files which would use up all of your memory if you had to load
11844
Handles can deal with text information that is being read
11845
(e. g. reading from a file) or written (e. g. writing information to a
11846
file). In the case of a "read" handle, commonly used functions are
11847
'read()', which reads the entire text information from the handle, and
11848
'readline()', which reads information one line at a time. For "write"
11849
handles, the function 'write()' is regularly used.
11850
The most common usage for handles is reading information from a file,
11851
which is done using the built-in Python function 'open'. Here, we open a
11852
handle to the file m_cold.fasta (1) (also available online here (2)):
11853
<<>>> handle = open("m_cold.fasta", "r")
11854
>>> handle.readline()
11855
">gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthemum ...\n"
11858
Handles are regularly used in Biopython for passing information to
11862
18.1.1 Creating a handle from a string
11863
=======================================
11865
One useful thing is to be able to turn information contained in a
11866
string into a handle. The following example shows how to do this using
11867
'cStringIO' from the Python standard library:
11868
<<>>> my_info = 'A string\n with multiple lines.'
11871
with multiple lines.
11872
>>> import cStringIO
11873
>>> my_info_handle = cStringIO.StringIO(my_info)
11874
>>> first_line = my_info_handle.readline()
11875
>>> print first_line
11878
>>> second_line = my_info_handle.readline()
11879
>>> print second_line
11880
with multiple lines.
11883
-----------------------------------------------------------------------
11885
This document was translated from LaTeX by HeVeA (3).
11886
-----------------------------------
11889
(1) examples/m_cold.fasta
11891
(2) http://biopython.org/DIST/docs/tutorial/examples/m_cold.fasta
11893
(3) http://hevea.inria.fr/index.html