~ubuntu-branches/ubuntu/trusty/python-biopython/trusty-proposed : revision 7

1

2

3

4

5

6

Biopython Tutorial and Cookbook

7

*******************************

8

Jeff Chang, Brad Chapman, Iddo Friedberg, Thomas Hamelryck,

9

===========================================================

10

Michiel de Hoon, Peter Cock, Tiago Antao

11

========================================

12

Last Update -- 16 August 2009 (Biopython 1.51)

13

==============================================

14

15

16

Contents

17

********

18

19

20

- Chapter 1 Introduction

21

22

- 1.1 What is Biopython?

23

- 1.2 What can I find in the Biopython package

24

- 1.3 Installing Biopython

25

- 1.4 FAQ

26

27

- Chapter 2 Quick Start -- What can you do with Biopython?

28

29

- 2.1 General overview of what Biopython provides

30

- 2.2 Working with sequences

31

- 2.3 A usage example

32

- 2.4 Parsing sequence file formats

33

34

- 2.4.1 Simple FASTA parsing example

35

- 2.4.2 Simple GenBank parsing example

36

- 2.4.3 I love parsing -- please don't stop talking about it!

37

38

- 2.5 Connecting with biological databases

39

- 2.6 What to do next

40

41

- Chapter 3 Sequence objects

42

43

- 3.1 Sequences and Alphabets

44

- 3.2 Sequences act like strings

45

- 3.3 Slicing a sequence

46

- 3.4 Turning Seq objects into strings

47

- 3.5 Concatenating or adding sequences

48

- 3.6 Nucleotide sequences and (reverse) complements

49

- 3.7 Transcription

50

- 3.8 Translation

51

- 3.9 Translation Tables

52

- 3.10 Comparing Seq objects

53

- 3.11 MutableSeq objects

54

- 3.12 UnknownSeq objects

55

- 3.13 Working with directly strings

56

57

- Chapter 4 Sequence Record objects

58

59

- 4.1 The SeqRecord object

60

- 4.2 Creating a SeqRecord

61

62

- 4.2.1 SeqRecord objects from scratch

63

- 4.2.2 SeqRecord objects from FASTA files

64

- 4.2.3 SeqRecord objects from GenBank files

65

66

- 4.3 SeqFeature objects

67

68

- 4.3.1 SeqFeatures themselves

69

- 4.3.2 Locations

70

71

- 4.4 References

72

- 4.5 The format method

73

- 4.6 Slicing a SeqRecord

74

75

- Chapter 5 Sequence Input/Output

76

77

- 5.1 Parsing or Reading Sequences

78

79

- 5.1.1 Reading Sequence Files

80

- 5.1.2 Iterating over the records in a sequence file

81

- 5.1.3 Getting a list of the records in a sequence file

82

- 5.1.4 Extracting data

83

84

- 5.2 Parsing sequences from the net

85

86

- 5.2.1 Parsing GenBank records from the net

87

- 5.2.2 Parsing SwissProt sequences from the net

88

89

- 5.3 Sequence files as Dictionaries

90

91

- 5.3.1 Specifying the dictionary keys

92

- 5.3.2 Indexing a dictionary using the SEGUID checksum

93

94

- 5.4 Writing Sequence Files

95

96

- 5.4.1 Converting between sequence file formats

97

- 5.4.2 Converting a file of sequences to their reverse

98

complements

99

- 5.4.3 Getting your SeqRecord objects as formatted strings

100

101

102

- Chapter 6 Sequence Alignment Input/Output, and Alignment Tools

103

104

- 6.1 Parsing or Reading Sequence Alignments

105

106

- 6.1.1 Single Alignments

107

- 6.1.2 Multiple Alignments

108

- 6.1.3 Ambiguous Alignments

109

110

- 6.2 Writing Alignments

111

112

- 6.2.1 Converting between sequence alignment file formats

113

- 6.2.2 Getting your Alignment objects as formatted strings

114

115

- 6.3 Alignment Tools

116

117

- 6.3.1 ClustalW

118

- 6.3.2 MUSCLE

119

- 6.3.3 MUSCLE using stdout

120

- 6.3.4 MUSCLE using stdin and stdout

121

- 6.3.5 EMBOSS needle and water

122

123

124

- Chapter 7 BLAST

125

126

- 7.1 Running BLAST locally

127

- 7.2 Running BLAST over the Internet

128

- 7.3 Saving BLAST output

129

- 7.4 Parsing BLAST output

130

- 7.5 The BLAST record class

131

- 7.6 Deprecated BLAST parsers

132

133

- 7.6.1 Parsing plain-text BLAST output

134

- 7.6.2 Parsing a plain-text BLAST file full of BLAST runs

135

- 7.6.3 Finding a bad record somewhere in a huge plain-text

136

BLAST file

137

138

- 7.7 Dealing with PSI-BLAST

139

- 7.8 Dealing with RPS-BLAST

140

141

- Chapter 8 Accessing NCBI's Entrez databases

142

143

- 8.1 Entrez Guidelines

144

- 8.2 EInfo: Obtaining information about the Entrez databases

145

- 8.3 ESearch: Searching the Entrez databases

146

- 8.4 EPost: Uploading a list of identifiers

147

- 8.5 ESummary: Retrieving summaries from primary IDs

148

- 8.6 EFetch: Downloading full records from Entrez

149

- 8.7 ELink: Searching for related items in NCBI Entrez

150

- 8.8 EGQuery: Obtaining counts for search terms

151

- 8.9 ESpell: Obtaining spelling suggestions

152

- 8.10 Specialized parsers

153

154

- 8.10.1 Parsing Medline records

155

- 8.10.2 Parsing GEO records

156

- 8.10.3 Parsing UniGene records

157

158

- 8.11 Using a proxy

159

- 8.12 Examples

160

161

- 8.12.1 PubMed and Medline

162

- 8.12.2 Searching, downloading, and parsing Entrez Nucleotide

163

records

164

- 8.12.3 Searching, downloading, and parsing GenBank records

165

- 8.12.4 Finding the lineage of an organism

166

167

- 8.13 Using the history and WebEnv

168

169

- 8.13.1 Searching for and downloading sequences using the

170

history

171

- 8.13.2 Searching for and downloading abstracts using the

172

history

173

174

175

- Chapter 9 Swiss-Prot and ExPASy

176

177

- 9.1 Parsing Swiss-Prot files

178

179

- 9.1.1 Parsing Swiss-Prot records

180

- 9.1.2 Parsing the Swiss-Prot keyword and category list

181

182

- 9.2 Parsing Prosite records

183

- 9.3 Parsing Prosite documentation records

184

- 9.4 Parsing Enzyme records

185

- 9.5 Accessing the ExPASy server

186

187

- 9.5.1 Retrieving a Swiss-Prot record

188

- 9.5.2 Searching Swiss-Prot

189

- 9.5.3 Retrieving Prosite and Prosite documentation records

190

191

- 9.6 Scanning the Prosite database

192

193

- Chapter 10 Going 3D: The PDB module

194

195

- 10.1 Structure representation

196

197

- 10.1.1 Structure

198

- 10.1.2 Model

199

- 10.1.3 Chain

200

- 10.1.4 Residue

201

- 10.1.5 Atom

202

203

- 10.2 Disorder

204

205

- 10.2.1 General approach

206

- 10.2.2 Disordered atoms

207

- 10.2.3 Disordered residues

208

209

- 10.3 Hetero residues

210

211

- 10.3.1 Associated problems

212

- 10.3.2 Water residues

213

- 10.3.3 Other hetero residues

214

215

- 10.4 Some random usage examples

216

- 10.5 Common problems in PDB files

217

218

- 10.5.1 Examples

219

- 10.5.2 Automatic correction

220

- 10.5.3 Fatal errors

221

222

- 10.6 Other features

223

224

- Chapter 11 Bio.PopGen: Population genetics

225

226

- 11.1 GenePop

227

- 11.2 Coalescent simulation

228

229

- 11.2.1 Creating scenarios

230

- 11.2.2 Running SIMCOAL2

231

232

- 11.3 Other applications

233

234

- 11.3.1 FDist: Detecting selection and molecular adaptation

235

236

- 11.4 Future Developments

237

238

- Chapter 12 Supervised learning methods

239

240

- 12.1 The Logistic Regression Model

241

242

- 12.1.1 Background and Purpose

243

- 12.1.2 Training the logistic regression model

244

- 12.1.3 Using the logistic regression model for

245

classification

246

- 12.1.4 Logistic Regression, Linear Discriminant Analysis,

247

and Support Vector Machines

248

249

- 12.2 k-Nearest Neighbors

250

251

- 12.2.1 Background and purpose

252

- 12.2.2 Initializing a k-nearest neighbors model

253

- 12.2.3 Using a k-nearest neighbors model for classification

254

255

- 12.3 Naive Bayes

256

- 12.4 Maximum Entropy

257

- 12.5 Markov Models

258

259

- Chapter 13 Graphics including GenomeDiagram

260

261

- 13.1 GenomeDiagram

262

263

- 13.1.1 Introduction

264

- 13.1.2 Diagrams, tracks, feature-sets and features

265

- 13.1.3 A top down example

266

- 13.1.4 A bottom up example

267

- 13.1.5 Features without a SeqFeature

268

- 13.1.6 Feature captions

269

- 13.1.7 Feature sigils

270

- 13.1.8 A nice example

271

- 13.1.9 Further options

272

- 13.1.10 Converting old code

273

274

- 13.2 Chromosomes

275

276

- Chapter 14 Cookbook -- Cool things to do with it

277

278

- 14.1 Working with sequence files

279

280

- 14.1.1 Producing randomised genomes

281

- 14.1.2 Translating a FASTA file of CDS entries

282

- 14.1.3 Simple quality filtering for FASTQ files

283

- 14.1.4 Trimming off primer sequences

284

- 14.1.5 Trimming off adaptor sequences

285

- 14.1.6 Converting FASTQ files

286

- 14.1.7 Identifying open reading frames

287

288

- 14.2 Sequence parsing plus simple plots

289

290

- 14.2.1 Histogram of sequence lengths

291

- 14.2.2 Plot of sequence GC%

292

- 14.2.3 Nucleotide dot plots

293

- 14.2.4 Plotting the quality scores of sequencing read data

294

295

- 14.3 Dealing with alignments

296

297

- 14.3.1 Calculating summary information

298

- 14.3.2 Calculating a quick consensus sequence

299

- 14.3.3 Position Specific Score Matrices

300

- 14.3.4 Information Content

301

302

- 14.4 Substitution Matrices

303

304

- 14.4.1 Using common substitution matrices

305

- 14.4.2 Creating your own substitution matrix from an

306

alignment

307

308

- 14.5 BioSQL -- storing sequences in a relational database

309

- 14.6 InterPro

310

311

- Chapter 15 The Biopython testing framework

312

313

- 15.1 Running the tests

314

- 15.2 Writing tests

315

316

- 15.2.1 Writing a print-and-compare test

317

- 15.2.2 Writing a unittest-based test

318

319

- 15.3 Writing doctests

320

321

- Chapter 16 Advanced

322

323

- 16.1 Parser Design

324

- 16.2 Substitution Matrices

325

326

- 16.2.1 SubsMat

327

- 16.2.2 FreqTable

328

329

330

- Chapter 17 Where to go from here -- contributing to Biopython

331

332

- 17.1 Bug Reports + Feature Requests

333

- 17.2 Mailing lists and helping newcomers

334

- 17.3 Contributing Documentation

335

- 17.4 Contributing cookbook examples

336

- 17.5 Maintaining a distribution for a platform

337

- 17.6 Contributing Unit Tests

338

- 17.7 Contributing Code

339

340

- Chapter 18 Appendix: Useful stuff about Python

341

342

- 18.1 What the heck is a handle?

343

344

- 18.1.1 Creating a handle from a string

345

346

347

348

349

350

Chapter 1 Introduction

351

*************************

352

353

354

355

1.1 What is Biopython?

356

*=*=*=*=*=*=*=*=*=*=*=*

357

358

359

The Biopython Project is an international association of developers of

360

freely available Python (http://www.python.org) tools for computational

361

molecular biology. The web site http://www.biopython.org provides an

362

online resource for modules, scripts, and web links for developers of

363

Python-based software for life science research.

364

Basically, we just like to program in Python and want to make it as

365

easy as possible to use Python for bioinformatics by creating

366

high-quality, reusable modules and scripts.

367

368

369

1.2 What can I find in the Biopython package

370

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

371

372

373

The main Biopython releases have lots of functionality, including:

374

375

376

- The ability to parse bioinformatics files into Python utilizable

377

data structures, including support for the following formats:

378

379

380

- Blast output -- both from standalone and WWW Blast

381

- Clustalw

382

- FASTA

383

- GenBank

384

- PubMed and Medline

385

- ExPASy files, like Enzyme and Prosite

386

- SCOP, including `dom' and `lin' files

387

- UniGene

388

- SwissProt

389

390

391

- Files in the supported formats can be iterated over record by

392

record or indexed and accessed via a Dictionary interface.

393

394

- Code to deal with popular on-line bioinformatics destinations such

395

as:

396

397

398

- NCBI -- Blast, Entrez and PubMed services

399

- ExPASy -- Swiss-Prot and Prosite entries, as well as Prosite

400

searches

401

402

403

- Interfaces to common bioinformatics programs such as:

404

405

406

- Standalone Blast from NCBI

407

- Clustalw alignment program

408

- EMBOSS command line tools

409

410

411

- A standard sequence class that deals with sequences, ids on

412

sequences, and sequence features.

413

414

- Tools for performing common operations on sequences, such as

415

translation, transcription and weight calculations.

416

417

- Code to perform classification of data using k Nearest Neighbors,

418

Naive Bayes or Support Vector Machines.

419

420

- Code for dealing with alignments, including a standard way to

421

create and deal with substitution matrices.

422

423

- Code making it easy to split up parallelizable tasks into separate

424

processes.

425

426

- GUI-based programs to do basic sequence manipulations,

427

translations, BLASTing, etc.

428

429

- Extensive documentation and help with using the modules, including

430

this file, on-line wiki documentation, the web site, and the mailing

431

list.

432

433

- Integration with BioSQL, a sequence database schema also supported

434

by the BioPerl and BioJava projects.

435

436

We hope this gives you plenty of reasons to download and start using

437

Biopython!

438

439

440

1.3 Installing Biopython

441

*=*=*=*=*=*=*=*=*=*=*=*=*

442

443

444

All of the installation information for Biopython was separated from

445

this document to make it easier to keep updated.

446

The short version is go to our downloads page

447

(http://biopython.org/wiki/Download), download and install the listed

448

dependencies, then download and install Biopython. For Windows we

449

provide pre-compiled click-and-run installers, while for Unix and other

450

operating systems you must install from source as described in the

451

included README file. This is usually as simple as the standard

452

commands:

453

<<python setup.py build

454

python setup.py test

455

sudo python setup.py install

456

>>

457

458

(You can in fact skip the build and test, and go straight to the

459

install -- but its better to make sure everything seems to be working.)

460

The longer version of our installation instructions covers

461

installation of Python, Biopython dependencies and Biopython itself. It

462

is available in PDF

463

(http://biopython.org/DIST/docs/install/Installation.pdf) and HTML

464

formats (http://biopython.org/DIST/docs/install/Installation.html).

465

466

467

1.4 FAQ

468

*=*=*=*=

469

470

471

472

473

474

1. How do I cite Biopython in a scientific publication?

475

Please cite our application note, Cock et al. 2009,

476

doi:10.1093/bioinformatics/btp163 (1), and/or one of the publications

477

listed on our website describing specific modules within Biopython.

478

479

2. How should I capitalize "Biopython"? Is "BioPython" OK?

480

The correct capitalization is "Biopython", not "BioPython" (even

481

though that would have matched BioPerl, BioJava and BioRuby).

482

483

3. Where is the latest version of this document?

484

If you download a Biopython source code archive, it will include the

485

relevant version in both HTML and PDF formats. The latest published

486

version of this document is online at:

487

488

- http://biopython.org/DIST/docs/tutorial/Tutorial.html

489

- http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

490

491

492

4. Which "Numerical Python" do I need?

493

For Biopython 1.48 or earlier, you need the old Numeric module. For

494

Biopython 1.49 onwards, you need the newer NumPy instead. Both

495

Numeric and NumPy can be installed on the same machine fine. See

496

also: http://numpy.scipy.org/

497

498

5. Why is the 'Seq' object missing the (back) transcription &

499

translation methods described in this Tutorial?

500

You need Biopython 1.49 or later. Alternatively, use the 'Bio.Seq'

501

module functions described in Section 3.13.

502

503

6. Why does't the 'Seq' object translation method support the 'cds'

504

option described in this Tutorial?

505

You need Biopython 1.51 or later.

506

507

7. Why doesn't 'Bio.SeqIO' work? It imports fine but there is no

508

parse function etc.

509

You need Biopython 1.43 or later. Older versions did contain some

510

related code under the 'Bio.SeqIO' name which has since been removed

511

- and this is why the import "works".

512

513

8. Why doesn't 'Bio.SeqIO.read()' work? The module imports fine but

514

there is no read function!

515

You need Biopython 1.45 or later. Or, use Bio.SeqIO.parse(...).next()

516

instead.

517

518

9. Why isn't 'Bio.AlignIO' present? The module import fails!

519

You need Biopython 1.46 or later.

520

521

10. What file formats do 'Bio.SeqIO' and 'Bio.AlignIO' read and

522

write?

523

Check the built in docstrings (from Bio import SeqIO, then

524

help(SeqIO)), or see http://biopython.org/wiki/SeqIO and

525

http://biopython.org/wiki/AlignIO on the wiki for the latest listing.

526

527

11. Why don't the 'Bio.SeqIO' and 'Bio.AlignIO' input functions let

528

me provide a sequence alphabet?

529

You need Biopython 1.49 or later.

530

531

12. Why doesn't 'str(...)' give me the full sequence of a 'Seq'

532

object?

533

You need Biopython 1.45 or later. Alternatively, rather than

534

'str(my_seq)', use 'my_seq.tostring()' (which will also work on

535

recent versions of Biopython).

536

537

13. Why doesn't 'Bio.Blast' work with the latest plain text NCBI

538

blast output?

539

The NCBI keep tweaking the plain text output from the BLAST tools, and

540

keeping our parser up to date was an ongoing struggle. We recommend

541

you use the XML output instead, which is designed to be read by a

542

computer program.

543

544

14. Why doesn't 'Bio.Entrez.read()' work? The module imports fine but

545

there is no read function!

546

You need Biopython 1.46 or later.

547

548

15. Why doesn't 'Bio.PDB.MMCIFParser' work? I see an import error

549

about 'MMCIFlex'

550

Since Biopython 1.42, the underlying 'Bio.PDB.mmCIF.MMCIFlex' module

551

has not been installed by default. It requires a third party tool

552

called flex (fast lexical analyzer generator). At the time of

553

writing, you'll have install flex, then tweak your Biopython

554

'setup.py' file and reinstall from source.

555

556

16. Why doesn't 'Bio.Blast.NCBIXML.read()' work? The module imports

557

but there is no read function!

558

You need Biopython 1.50 or later. Or, use

559

Bio.Blast.NCBIXML.parse(...).next() instead.

560

561

17. Why doesn't my 'SeqRecord' object have a 'letter_annotations'

562

attribute?

563

Per-letter-annotation support was added in Biopython 1.50.

564

565

18. Why can't I slice my 'SeqRecord' to get a sub-record?

566

You need Biopython 1.50 or later.

567

568

19. I looked in a directory for code, but I couldn't find the code

569

that does something. Where's it hidden?

570

One thing to know is that we put code in '__init__.py' files. If you

571

are not used to looking for code in this file this can be confusing.

572

The reason we do this is to make the imports easier for users. For

573

instance, instead of having to do a "repetitive" import like 'from

574

Bio.GenBank import GenBank', you can just use 'from Bio import

575

GenBank'.

576

577

-----------------------------------

578

579

580

(1) http://dx.doi.org/10.1093/bioinformatics/btp163

581

582

583

Chapter 2 Quick Start -- What can you do with Biopython?

584

***********************************************************

585

586

This section is designed to get you started quickly with Biopython,

587

and to give a general overview of what is available and how to use it.

588

All of the examples in this section assume that you have some general

589

working knowledge of Python, and that you have successfully installed

590

Biopython on your system. If you think you need to brush up on your

591

Python, the main Python web site provides quite a bit of free

592

documentation to get started with (http://www.python.org/doc/).

593

Since much biological work on the computer involves connecting with

594

databases on the internet, some of the examples will also require a

595

working internet connection in order to run.

596

Now that that is all out of the way, let's get into what we can do

597

with Biopython.

598

599

600

2.1 General overview of what Biopython provides

601

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

602

603

604

As mentioned in the introduction, Biopython is a set of libraries to

605

provide the ability to deal with "things" of interest to biologists

606

working on the computer. In general this means that you will need to

607

have at least some programming experience (in Python, of course!) or at

608

least an interest in learning to program. Biopython's job is to make

609

your job easier as a programmer by supplying reusable libraries so that

610

you can focus on answering your specific question of interest, instead

611

of focusing on the internals of parsing a particular file format (of

612

course, if you want to help by writing a parser that doesn't exist and

613

contributing it to Biopython, please go ahead!). So Biopython's job is

614

to make you happy!

615

One thing to note about Biopython is that it often provides multiple

616

ways of "doing the same thing." Things have improved in recent releases,

617

but this can still be frustrating as in Python there should ideally be

618

one right way to do something. However, this can also be a real benefit

619

because it gives you lots of flexibility and control over the libraries.

620

The tutorial helps to show you the common or easy ways to do things so

621

that you can just make things work. To learn more about the alternative

622

possibilities, look in the Cookbook (Chapter 14, this has some cools

623

tricks and tips), the Advanced section (Chapter 16), the built in

624

"docstrings" (via the Python help command, or the API documentation (1))

625

or ultimately the code itself.

626

627

628

2.2 Working with sequences

629

*=*=*=*=*=*=*=*=*=*=*=*=*=*

630

631

632

Disputably (of course!), the central object in bioinformatics is the

633

sequence. Thus, we'll start with a quick introduction to the Biopython

634

mechanisms for dealing with sequences, the 'Seq' object, which we'll

635

discuss in more detail in Chapter 3.

636

Most of the time when we think about sequences we have in my mind a

637

string of letters like `'AGTACACTGGT''. You can create such 'Seq' object

638

with this sequence as follows - the ">>>" represents the Python prompt

639

followed by what you would type in:

640

<<>>> from Bio.Seq import Seq

641

>>> my_seq = Seq("AGTACACTGGT")

642

>>> my_seq

643

Seq('AGTACACTGGT', Alphabet())

644

>>> print my_seq

645

AGTACACTGGT

646

>>> my_seq.alphabet

647

Alphabet()

648

>>

649

650

What we have here is a sequence object with a generic alphabet -

651

reflecting the fact we have not specified if this is a DNA or protein

652

sequence (okay, a protein with a lot of Alanines, Glycines, Cysteines

653

and Threonines!). We'll talk more about alphabets in Chapter 3.

654

In addition to having an alphabet, the 'Seq' object differs from the

655

Python string in the methods it supports. You can't do this with a plain

656

string:

657

<<>>> my_seq

658

Seq('AGTACACTGGT', Alphabet())

659

>>> my_seq.complement()

660

Seq('TCATGTGACCA', Alphabet())

661

>>> my_seq.reverse_complement()

662

Seq('ACCAGTGTACT', Alphabet())

663

>>

664

665

The next most important class is the 'SeqRecord' or Sequence Record.

666

This holds a sequence (as a 'Seq' object) with additional annotation

667

including an identifier, name and description. The 'Bio.SeqIO' module

668

for reading and writing sequence file formats works with 'SeqRecord'

669

objects, which will be introduced below and covered in more detail by

670

Chapter 5.

671

This covers the basic features and uses of the Biopython sequence

672

class. Now that you've got some idea of what it is like to interact with

673

the Biopython libraries, it's time to delve into the fun, fun world of

674

dealing with biological file formats!

675

676

677

2.3 A usage example

678

*=*=*=*=*=*=*=*=*=*=

679

680

681

Before we jump right into parsers and everything else to do with

682

Biopython, let's set up an example to motivate everything we do and make

683

life more interesting. After all, if there wasn't any biology in this

684

tutorial, why would you want you read it?

685

Since I love plants, I think we're just going to have to have a plant

686

based example (sorry to all the fans of other organisms out there!).

687

Having just completed a recent trip to our local greenhouse, we've

688

suddenly developed an incredible obsession with Lady Slipper Orchids (if

689

you wonder why, have a look at some Lady Slipper Orchids photos on

690

Flickr (2), or try a Google Image Search (3)).

691

Of course, orchids are not only beautiful to look at, they are also

692

extremely interesting for people studying evolution and systematics. So

693

let's suppose we're thinking about writing a funding proposal to do a

694

molecular study of Lady Slipper evolution, and would like to see what

695

kind of research has already been done and how we can add to that.

696

After a little bit of reading up we discover that the Lady Slipper

697

Orchids are in the Orchidaceae family and the Cypripedioideae sub-family

698

and are made up of 5 genera: Cypripedium, Paphiopedilum, Phragmipedium,

699

Selenipedium and Mexipedium.

700

That gives us enough to get started delving for more information. So,

701

let's look at how the Biopython tools can help us. We'll start with

702

sequence parsing in Section 2.4, but the orchids will be back later on

703

as well - for example we'll search PubMed for papers about orchids and

704

extract sequence data from GenBank in Chapter 8, extract data from

705

Swiss-Prot from certain orchid proteins in Chapter 9, and work with

706

ClustalW multiple sequence alignments of orchid proteins in

707

Section 6.3.1.

708

709

710

2.4 Parsing sequence file formats

711

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

712

713

714

A large part of much bioinformatics work involves dealing with the

715

many types of file formats designed to hold biological data. These files

716

are loaded with interesting biological data, and a special challenge is

717

parsing these files into a format so that you can manipulate them with

718

some kind of programming language. However the task of parsing these

719

files can be frustrated by the fact that the formats can change quite

720

regularly, and that formats may contain small subtleties which can break

721

even the most well designed parsers.

722

We are now going to briefly introduce the 'Bio.SeqIO' module -- you

723

can find out more in Chapter 5. We'll start with an online search for

724

our friends, the lady slipper orchids. To keep this introduction simple,

725

we're just using the NCBI website by hand. Let's just take a look

726

through the nucleotide databases at NCBI, using an Entrez online search

727

(http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide) for

728

everything mentioning the text Cypripedioideae (this is the subfamily of

729

lady slipper orchids).

730

When this tutorial was originally written, this search gave us only 94

731

hits, which we saved as a FASTA formatted text file and as a GenBank

732

formatted text file (files ls_orchid.fasta (4) and ls_orchid.gbk (5),

733

also included with the Biopython source code under

734

docs/tutorial/examples/).

735

If you run the search today, you'll get hundreds of results! When

736

following the tutorial, if you want to see the same list of genes, just

737

download the two files above or copy them from 'docs/examples/' in the

738

Biopython source code. In Section 2.5 we will look at how to do a search

739

like this from within Python.

740

741

742

2.4.1 Simple FASTA parsing example

743

===================================

744

745

If you open the lady slipper orchids FASTA file ls_orchid.fasta (6) in

746

your favourite text editor, you'll see that the file starts like this:

747

<<>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1

748

and ITS2 DNA

749

CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG

750

AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG

751

...

752

>>

753

754

It contains 94 records, each has a line starting with ">"

755

(greater-than symbol) followed by the sequence on one or more lines. Now

756

try this in Python:

757

<<from Bio import SeqIO

758

handle = open("ls_orchid.fasta")

759

for seq_record in SeqIO.parse(handle, "fasta") :

760

print seq_record.id

761

print repr(seq_record.seq)

762

print len(seq_record)

763

handle.close()

764

>>

765

766

You should get something like this on your screen:

767

<<gi|2765658|emb|Z78533.1|CIZ78533

768

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',

769

SingleLetterAlphabet())

770

740

771

...

772

gi|2765564|emb|Z78439.1|PBZ78439

773

Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC',

774

SingleLetterAlphabet())

775

592

776

>>

777

778

Notice that the FASTA format does not specify the alphabet, so

779

'Bio.SeqIO' has defaulted to the rather generic 'SingleLetterAlphabet()'

780

rather than something DNA specific.

781

782

783

2.4.2 Simple GenBank parsing example

784

=====================================

785

786

Now let's load the GenBank file ls_orchid.gbk (7) instead - notice

787

that the code to do this is almost identical to the snippet used above

788

for the FASTA file - the only difference is we change the filename and

789

the format string:

790

<<from Bio import SeqIO

791

handle = open("ls_orchid.gbk")

792

for seq_record in SeqIO.parse(handle, "genbank") :

793

print seq_record.id

794

print repr(seq_record.seq)

795

print len(seq_record)

796

handle.close()

797

>>

798

799

This should give:

800

<<Z78533.1

801

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',

802

IUPACAmbiguousDNA())

803

740

804

...

805

Z78439.1

806

Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC',

807

IUPACAmbiguousDNA())

808

592

809

>>

810

811

This time 'Bio.SeqIO' has been able to choose a sensible alphabet,

812

IUPAC Ambiguous DNA. You'll also notice that a shorter string has been

813

used as the 'seq_record.id' in this case.

814

815

816

2.4.3 I love parsing -- please don't stop talking about it!

817

============================================================

818

819

Biopython has a lot of parsers, and each has its own little special

820

niches based on the sequence format it is parsing and all of that.

821

Chapter 5 covers 'Bio.SeqIO' in more detail, while Chapter 6 introduces

822

'Bio.AlignIO' for sequence alignments.

823

While the most popular file formats have parsers integrated into

824

'Bio.SeqIO' and/or 'Bio.AlignIO', for some of the rarer and unloved file

825

formats there is either no parser at all, or an old parser which has not

826

been linked in yet. Please also check the wiki pages

827

http://biopython.org/wiki/SeqIO and http://biopython.org/wiki/AlignIO

828

for the latest information, or ask on the mailing list. The wiki pages

829

should include an up to date list of supported file types, and some

830

additional examples.

831

The next place to look for information about specific parsers and how

832

to do cool things with them is in the Cookbook (Chapter 14 of this

833

Tutorial). If you don't find the information you are looking for, please

834

consider helping out your poor overworked documentors and submitting a

835

cookbook entry about it! (once you figure out how to do it, that is!)

836

837

838

2.5 Connecting with biological databases

839

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

840

841

842

One of the very common things that you need to do in bioinformatics is

843

extract information from biological databases. It can be quite tedious

844

to access these databases manually, especially if you have a lot of

845

repetitive work to do. Biopython attempts to save you time and energy by

846

making some on-line databases available from Python scripts. Currently,

847

Biopython has code to extract information from the following databases:

848

849

850

- Entrez (8) (and PubMed (9)) from the NCBI -- See Chapter 8.

851

- ExPASy (10) -- See Chapter 9.

852

- SCOP (11) -- See the 'Bio.SCOP.search()' function.

853

854

The code in these modules basically makes it easy to write Python code

855

that interact with the CGI scripts on these pages, so that you can get

856

results in an easy to deal with format. In some cases, the results can

857

be tightly integrated with the Biopython parsers to make it even easier

858

to extract information.

859

860

861

2.6 What to do next

862

*=*=*=*=*=*=*=*=*=*=

863

864

865

Now that you've made it this far, you hopefully have a good

866

understanding of the basics of Biopython and are ready to start using it

867

for doing useful work. The best thing to do now is finish reading this

868

tutorial, and then if you want start snooping around in the source code,

869

and looking at the automatically generated documentation.

870

Once you get a picture of what you want to do, and what libraries in

871

Biopython will do it, you should take a peak at the Cookbook

872

(Chapter 14), which may have example code to do something similar to

873

what you want to do.

874

If you know what you want to do, but can't figure out how to do it,

875

please feel free to post questions to the main Biopython list (see

876

http://biopython.org/wiki/Mailing_lists). This will not only help us

877

answer your question, it will also allow us to improve the documentation

878

so it can help the next person do what you want to do.

879

Enjoy the code!

880

-----------------------------------

881

882

883

(1) http://biopython.org/DIST/docs/api/

884

885

(2) http://www.flickr.com/search/?q=lady+slipper+orchid&s=int&z=t

886

887

(3) http://images.google.com/images?q=lady%20slipper%20orchid

888

889

(4) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta

890

891

(5) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk

892

893

(6) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta

894

895

(7) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk

896

897

(8) http://www.ncbi.nlm.nih.gov/Entrez/

898

899

(9) http://www.ncbi.nlm.nih.gov/PubMed/

900

901

(10) http://www.expasy.org/

902

903

(11) http://scop.mrc-lmb.cam.ac.uk/scop/

904

905

906

Chapter 3 Sequence objects

907

*****************************

908

909

Biological sequences are arguably the central object in

910

Bioinformatics, and in this chapter we'll introduce the Biopython

911

mechanism for dealing with sequences, the 'Seq' object. Chapter 4 will

912

introduce the related 'SeqRecord' object, which combines the sequence

913

information with any annotation, used again in Chapter 5 for Sequence

914

Input/Output.

915

Sequences are essentially strings of letters like 'AGTACACTGGT', which

916

seems very natural since this is the most common way that sequences are

917

seen in biological file formats.

918

There are two important differences between 'Seq' objects and standard

919

Python strings. First of all, they have different methods. Although the

920

'Seq' object supports many of the same methods as a plain string, its

921

'translate()' method differs by doing biological translation, and there

922

are also additional biologically relevant methods like

923

'reverse_complement()'. Secondly, the 'Seq' object has an important

924

attribute, 'alphabet', which is an object describing what the individual

925

characters making up the sequence string "mean", and how they should be

926

interpreted. For example, is 'AGTACACTGGT' a DNA sequence, or just a

927

protein sequence that happens to be rich in Alanines, Glycines,

928

Cysteines and Threonines?

929

930

931

3.1 Sequences and Alphabets

932

*=*=*=*=*=*=*=*=*=*=*=*=*=*=

933

934

935

The alphabet object is perhaps the important thing that makes the

936

'Seq' object more than just a string. The currently available alphabets

937

for Biopython are defined in the 'Bio.Alphabet' module. We'll use the

938

IUPAC alphabets (http://www.chem.qmw.ac.uk/iupac/) here to deal with

939

some of our favorite objects: DNA, RNA and Proteins.

940

'Bio.Alphabet.IUPAC' provides basic definitions for proteins, DNA and

941

RNA, but additionally provides the ability to extend and customize the

942

basic definitions. For instance, for proteins, there is a basic

943

IUPACProtein class, but there is an additional ExtendedIUPACProtein

944

class providing for the additional elements "U" (or "Sec" for

945

selenocysteine) and "O" (or "Pyl" for pyrrolysine), plus the ambiguous

946

symbols "B" (or "Asx" for asparagine or aspartic acid), "Z" (or "Glx"

947

for glutamine or glutamic acid), "J" (or "Xle" for leucine isoleucine)

948

and "X" (or "Xxx" for an unknown amino acid). For DNA you've got choices

949

of IUPACUnambiguousDNA, which provides for just the basic letters,

950

IUPACAmbiguousDNA (which provides for ambiguity letters for every

951

possible situation) and ExtendedIUPACDNA, which allows letters for

952

modified bases. Similarly, RNA can be represented by IUPACAmbiguousRNA

953

or IUPACUnambiguousRNA.

954

The advantages of having an alphabet class are two fold. First, this

955

gives an idea of the type of information the Seq object contains.

956

Secondly, this provides a means of constraining the information, as a

957

means of type checking.

958

Now that we know what we are dealing with, let's look at how to

959

utilize this class to do interesting work. You can create an ambiguous

960

sequence with the default generic alphabet like this:

961

<<>>> from Bio.Seq import Seq

962

>>> my_seq = Seq("AGTACACTGGT")

963

>>> my_seq

964

Seq('AGTACACTGGT', Alphabet())

965

>>> my_seq.alphabet

966

Alphabet()

967

>>

968

969

However, where possible you should specify the alphabet explicitly

970

when creating your sequence objects - in this case an unambiguous DNA

971

alphabet object:

972

<<>>> from Bio.Seq import Seq

973

>>> from Bio.Alphabet import IUPAC

974

>>> my_seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna)

975

>>> my_seq

976

Seq('AGTACACTGGT', IUPACUnambiguousDNA())

977

>>> my_seq.alphabet

978

IUPACUnambiguousDNA()

979

>>

980

981

Unless of course, this really is an amino acid sequence:

982

<<>>> from Bio.Seq import Seq

983

>>> from Bio.Alphabet import IUPAC

984

>>> my_prot = Seq("AGTACACTGGT", IUPAC.protein)

985

>>> my_prot

986

Seq('AGTACACTGGT', IUPACProtein())

987

>>> my_prot.alphabet

988

IUPACProtein()

989

>>

990

991

992

993

3.2 Sequences act like strings

994

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

995

996

997

In many ways, we can deal with Seq objects as if they were normal

998

Python strings, for example getting the length, or iterating over the

999

elements:

1000

<<from Bio.Seq import Seq

1001

from Bio.Alphabet import IUPAC

1002

my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",

1003

IUPAC.unambiguous_dna)

1004

for index, letter in enumerate(my_seq) :

1005

print index, letter

1006

print len(letter)

1007

>>

1008

1009

You can access elements of the sequence in the same way as for strings

1010

(but remember, Python counts from zero!):

1011

<<>>> print my_seq[0] #first letter

1012

>>> print my_seq[2] #third letter

1013

>>> print my_seq[-1] #last letter

1014

>>

1015

1016

The 'Seq' object has a '.count()' method, just like a string. Note

1017

that this means that like a Python string, this gives a non-overlapping

1018

count:

1019

<<>>> "AAAA".count("AA")

1020

2

1021

>>> Seq("AAAA").count("AA")

1022

2

1023

>>

1024

1025

For some biological uses, you may actually want an overlapping count

1026

(i.e. 3 in this trivial example). When searching for single letters,

1027

this makes no difference:

1028

<<>>> len(my_seq)

1029

32

1030

>>> my_seq.count("G")

1031

10

1032

>>> 100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)

1033

46.875

1034

>>

1035

1036

While you could use the above snippet of code to calculate a GC%, note

1037

that the 'Bio.SeqUtils' module has several GC functions already built.

1038

For example:

1039

<<>>> from Bio.Seq import Seq

1040

>>> from Bio.Alphabet import IUPAC

1041

>>> from Bio.SeqUtils import GC

1042

>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',

1043

IUPAC.unambiguous_dna)

1044

>>> GC(my_seq)

1045

46.875

1046

>>

1047

1048

Note that using the 'Bio.SeqUtils.GC()' function should automatically

1049

cope with mixed case sequences and the ambiguous nucleotide S which

1050

means G or C.

1051

Also note that just like a normal Python string, the 'Seq' object is

1052

in some ways "read-only". If you need to edit your sequence, for example

1053

simulating a point mutation, look at the Section 3.11 below which talks

1054

about the 'MutableSeq' object.

1055

1056

1057

3.3 Slicing a sequence

1058

*=*=*=*=*=*=*=*=*=*=*=*

1059

1060

1061

A more complicated example, let's get a slice of the sequence:

1062

<<>>> from Bio.Seq import Seq

1063

>>> from Bio.Alphabet import IUPAC

1064

>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",

1065

IUPAC.unambiguous_dna)

1066

>>> my_seq[4:12]

1067

Seq('GATGGGCC', IUPACUnambiguousDNA())

1068

>>

1069

1070

Two things are interesting to note. First, this follows the normal

1071

conventions for Python strings. So the first element of the sequence is

1072

0 (which is normal for computer science, but not so normal for biology).

1073

When you do a slice the first item is included (i.e. 4 in this case) and

1074

the last is excluded (12 in this case), which is the way things work in

1075

Python, but of course not necessarily the way everyone in the world

1076

would expect. The main goal is to stay consistent with what Python does.

1077

The second thing to notice is that the slice is performed on the

1078

sequence data string, but the new object produced is another 'Seq'

1079

object which retains the alphabet information from the original 'Seq'

1080

object.

1081

Also like a Python string, you can do slices with a start, stop and

1082

stride (the step size, which defaults to one). For example, we can get

1083

the first, second and third codon positions of this DNA sequence:

1084

<<>>> my_seq[0::3]

1085

Seq('GCTGTAGTAAG', IUPACUnambiguousDNA())

1086

>>> my_seq[1::3]

1087

Seq('AGGCATGCATC', IUPACUnambiguousDNA())

1088

>>> my_seq[2::3]

1089

Seq('TAGCTAAGAC', IUPACUnambiguousDNA())

1090

>>

1091

1092

Another stride trick you might have seen with a Python string is the

1093

use of a -1 stride to reverse the string. You can do this with a 'Seq'

1094

object too:

1095

<<>>> my_seq[::-1]

1096

Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())

1097

>>

1098

1099

1100

1101

3.4 Turning Seq objects into strings

1102

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

1103

1104

1105

If you really do just need a plain string, for example to write to a

1106

file, or insert into a database, then this is very easy to get:

1107

<<>>> str(my_seq)

1108

'GATCGATGGGCCTATATAGGATCGAAAATCGC'

1109

>>

1110

1111

Since calling 'str()' on a 'Seq' object returns the full sequence as a

1112

string, you often don't actually have to do this conversion explicitly.

1113

Python does this automatically with a print statement:

1114

<<>>> print my_seq

1115

GATCGATGGGCCTATATAGGATCGAAAATCGC

1116

>>

1117

1118

You can also use the 'Seq' object directly with a '%s' placeholder

1119

when using the Python string formatting or interpolation operator ('%'):

1120

<<>>> fasta_format_string = ">Name\n%s\n" % my_seq

1121

>>> print fasta_format_string

1122

>Name

1123

GATCGATGGGCCTATATAGGATCGAAAATCGC

1124

>>

1125

1126

This line of code constructs a simple FASTA format record (without

1127

worrying about line wrapping). Section 4.5 describes a neat way to get a

1128

FASTA formatted string from a 'SeqRecord' object, while the more general

1129

topic of reading and writing FASTA format sequence files is covered in

1130

Chapter 5.

1131

NOTE: If you are using Biopython 1.44 or older, using 'str(my_seq)'

1132

will give just a truncated representation. Instead use

1133

'my_seq.tostring()' (which is still available in the current Biopython

1134

releases for backwards compatibility):

1135

<<>>> my_seq.tostring()

1136

'GATCGATGGGCCTATATAGGATCGAAAATCGC'

1137

>>

1138

1139

1140

1141

3.5 Concatenating or adding sequences

1142

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

1143

1144

1145

Naturally, you can in principle add any two Seq objects together -

1146

just like you can with Python strings to concatenate them. However, you

1147

can't add sequences with incompatible alphabets, such as a protein

1148

sequence and a DNA sequence:

1149

<<>>> protein_seq + dna_seq

1150

Traceback (most recent call last):

1151

...

1152

TypeError: ('incompatable alphabets', 'IUPACProtein()',

1153

'IUPACUnambiguousDNA()')

1154

>>

1155

1156

If you really wanted to do this, you'd have to first give both

1157

sequences generic alphabets:

1158

<<>>> from Bio.Alphabet import generic_alphabet

1159

>>> protein_seq.alphabet = generic_alphabet

1160

>>> dna_seq.alphabet = generic_alphabet

1161

>>> protein_seq + dna_seq

1162

Seq('EVRNAKACGT', Alphabet())

1163

>>

1164

1165

Here is an example of adding a generic nucleotide sequence to an

1166

unambiguous IUPAC DNA sequence, resulting in an ambiguous nucleotide

1167

sequence:

1168

<<>>> from Bio.Seq import Seq

1169

>>> from Bio.Alphabet import generic_nucleotide

1170

>>> from Bio.Alphabet import IUPAC

1171

>>> nuc_seq = Seq("GATCGATGC", generic_nucleotide)

1172

>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)

1173

>>> nuc_seq

1174

Seq('GATCGATGC', NucleotideAlphabet())

1175

>>> dna_seq

1176

Seq('ACGT', IUPACUnambiguousDNA())

1177

>>> nuc_seq + dna_seq

1178

Seq('GATCGATGCACGT', NucleotideAlphabet())

1179

>>

1180

1181

1182

1183

3.6 Nucleotide sequences and (reverse) complements

1184

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

1185

1186

1187

For nucleotide sequences, you can easily obtain the complement or

1188

reverse complement of a 'Seq' object using its built-in methods:

1189

<<>>> from Bio.Seq import Seq

1190

>>> from Bio.Alphabet import IUPAC

1191

>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",

1192

IUPAC.unambiguous_dna)

1193

>>> my_seq

1194

Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())

1195

>>> my_seq.complement()

1196

Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())

1197

>>> my_seq.reverse_complement()

1198

Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())

1199

>>

1200

1201

In all of these operations, the alphabet property is maintained. This

1202

is very useful in case you accidentally end up trying to do something

1203

weird like take the (reverse)complement of a protein sequence:

1204

<<>>> from Bio.Seq import Seq

1205

>>> from Bio.Alphabet import IUPAC

1206

>>> protein_seq = Seq("EVRNAK", IUPAC.protein)

1207

>>> protein_seq.complement()

1208

...

1209

ValueError: Proteins do not have complements!

1210

>>

1211

1212

The example in Section 5.4.2 combines the 'Seq' object's reverse

1213

complement method with 'Bio.SeqIO' for sequence input/ouput.

1214

1215

1216

3.7 Transcription

1217

*=*=*=*=*=*=*=*=*=

1218

1219

Before talking about transcription, I want to try and clarify the

1220

strand issue. Consider the following (made up) stretch of double

1221

stranded DNA which encodes a short peptide:

1222

1223

DNA coding strand (aka Crick strand, strand +1)

1224

5' ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3'

1225

|||||||||||||||||||||||||||||||||||||||

1226

3' TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5'

1227

DNA template strand (aka Watson strand, strand -1)

1228

1229

|

1230

Transcription

1231

1232

1233

5' AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3'

1234

Single stranded messenger RNA

1235

1236

1237

The actual biological transcription process works from the template

1238

strand, doing a reverse complement (TCAG -> CUGA) to give the mRNA.

1239

However, in Biopython and bioinformatics in general, we typically work

1240

directly with the coding strand because this means we can get the mRNA

1241

sequence just by switching T -> U.

1242

Now let's actually get down to doing a transcription in Biopython.

1243

First, let's create 'Seq' objects for the coding and template DNA

1244

strands:

1245

<<>>> from Bio.Seq import Seq

1246

>>> from Bio.Alphabet import IUPAC

1247

>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",

1248

IUPAC.unambiguous_dna)

1249

>>> coding_dna

1250

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())

1251

>>> template_dna = coding_dna.reverse_complement()

1252

>>> template_dna

1253

Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT', IUPACUnambiguousDNA())

1254

>>

1255

These should match the figure above - remember by convention

1256

nucleotide sequences are normally read from the 5' to 3' direction,

1257

while in the figure the template strand is shown reversed.

1258

Now let's transcribe the coding strand into the corresponding mRNA,

1259

using the 'Seq' object's built in 'transcribe' method:

1260

<<>>> coding_dna

1261

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())

1262

>>> messenger_rna = coding_dna.transcribe()

1263

>>> messenger_rna

1264

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

1265

>>

1266

As you can see, all this does is switch T -> U, and adjust the

1267

alphabet.

1268

If you do want to do a true biological transcription starting with the

1269

template strand, then this becomes a two-step process:

1270

<<>>> template_dna.reverse_complement().transcribe()

1271

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

1272

>>

1273

1274

The 'Seq' object also includes a back-transcription method for going

1275

from the mRNA to the coding strand of the DNA. Again, this is a simple U

1276

-> T substitution and associated change of alphabet:

1277

<<>>> from Bio.Seq import Seq

1278

>>> from Bio.Alphabet import IUPAC

1279

>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG",

1280

IUPAC.unambiguous_rna)

1281

>>> messenger_rna

1282

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

1283

>>> messenger_rna.back_transcribe()

1284

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())

1285

>>

1286

1287

Note: The 'Seq' object's 'transcribe' and 'back_transcribe' methods

1288

are new in Biopython 1.49. For older releases you would have to use the

1289

'Bio.Seq' module's functions instead, see Section 3.13.

1290

1291

1292

3.8 Translation

1293

*=*=*=*=*=*=*=*=

1294

1295

Sticking with the same example discussed in the transcription

1296

section above, now let's translate this mRNA into the corresponding

1297

protein sequence - again taking advantage of one of the 'Seq' object's

1298

biological methods:

1299

<<>>> from Bio.Seq import Seq

1300

>>> from Bio.Alphabet import IUPAC

1301

>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG",

1302

IUPAC.unambiguous_rna)

1303

>>> messenger_rna

1304

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

1305

>>> messenger_rna.translate()

1306

Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

1307

>>

1308

1309

You can also translate directly from the coding strand DNA sequence:

1310

<<>>> from Bio.Seq import Seq

1311

>>> from Bio.Alphabet import IUPAC

1312

>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",

1313

IUPAC.unambiguous_dna)

1314

>>> coding_dna

1315

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())

1316

>>> coding_dna.translate()

1317

Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

1318

>>

1319

1320

You should notice in the above protein sequences that in addition to

1321

the end stop character, there is an internal stop as well. This was a

1322

deliberate choice of example, as it gives an excuse to talk about some

1323

optional arguments, including different translation tables (Genetic

1324

Codes).

1325

The translation tables available in Biopython are based on those from

1326

the NCBI (1) (see the next section of this tutorial). By default,

1327

translation will use the standard genetic code (NCBI table id 1).

1328

Suppose we are dealing with a mitochondrial sequence. We need to tell

1329

the translation function to use the relevant genetic code instead:

1330

<<>>> coding_dna.translate(table="Vertebrate Mitochondrial")

1331

Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

1332

>>

1333

1334

You can also specify the table using the NCBI table number which is

1335

shorter, and often included in the feature annotation of GenBank files:

1336

<<>>> coding_dna.translate(table=2)

1337

Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

1338

>>

1339

1340

Now, you may want to translate the nucleotides up to the first in

1341

frame stop codon, and then stop (as happens in nature):

1342

<<>>> coding_dna.translate()

1343

Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

1344

>>> coding_dna.translate(to_stop=True)

1345

Seq('MAIVMGR', IUPACProtein())

1346

>>> coding_dna.translate(table=2)

1347

Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

1348

>>> coding_dna.translate(table=2, to_stop=True)

1349

Seq('MAIVMGRWKGAR', IUPACProtein())

1350

>>

1351

Notice that when you use the 'to_stop' argument, the stop codon itself

1352

is not translated - and the stop symbol is not included at the end of

1353

your protein sequence.

1354

You can even specify the stop symbol if you don't like the default

1355

asterisk:

1356

<<>>> coding_dna.translate(table=2, stop_symbol="@")

1357

Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '@'))

1358

>>

1359

1360

Now, suppose you have a complete coding sequence CDS, which is to say

1361

a nucleotide sequence (e.g. mRNA -- after any splicing) which is a whole

1362

number of codons (i.e. the length is a multiple of three), commences

1363

with a start codon, ends with a stop codon, and has no internal in-frame

1364

stop codons. In general, given a complete CDS, the default translate

1365

method will do what you want (perhaps with the 'to_stop' option).

1366

However, what if your sequence uses a non-standard start codon? This

1367

happens a lot in bacteria -- for example the gene yaaX in E. coli K12:

1368

<<>>> gene =

1369

Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" + \

1370

...

1371

"GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \

1372

...

1373

"AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \

1374

...

1375

"TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \

1376

...

1377

"AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA",

1378

... generic_dna)

1379

>>> gene.translate(table="Bacterial")

1380

Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*',

1381

HasStopCodon(ExtendedIUPACProtein(), '*')

1382

>>> gene.translate(table="Bacterial", to_stop=True)

1383

Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR',

1384

ExtendedIUPACProtein())

1385

>>

1386

1387

In the bacterial genetic code GTG is a valid start codon, and while it

1388

does normally encode valine, if used as a start codon it should be

1389

translated as methionine. This happens if you tell Biopython your

1390

sequence is a complete CDS:

1391

<<>>> gene.translate(table="Bacterial", cds=True)

1392

Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR',

1393

ExtendedIUPACProtein())

1394

>>

1395

1396

In addition to telling Biopython to translate an alternative start

1397

codon as methionine, using this option also makes sure your sequence

1398

really is a valid CDS (you'll get an exception if not).

1399

The example in Section 14.1.2 combines the 'Seq' object's translate

1400

method with 'Bio.SeqIO' for sequence input/ouput.

1401

Note: The 'Seq' object's 'translate' method is new in Biopython 1.49.

1402

For older releases you would have to use the 'Bio.Seq' module's

1403

'translate' function instead, see Section 3.13. The cds option is new in

1404

Biopython 1.51, and there is no simple way to do this with older

1405

versions of Biopython.

1406

1407

1408

3.9 Translation Tables

1409

*=*=*=*=*=*=*=*=*=*=*=*

1410

1411

1412

In the previous sections we talked about the 'Seq' object translation

1413

method (and mentioned the equivalent function in the 'Bio.Seq' module --

1414

see Section 3.13). Internally these use codon table objects derived from

1415

the NCBI information at

1416

ftp://ftp.ncbi.nlm.nih.gov/entrez/misc/data/gc.prt, also shown on

1417

http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi in a much more

1418

readable layout.

1419

As before, let's just focus on two choices: the Standard translation

1420

table, and the translation table for Vertebrate Mitochondrial DNA.

1421

<<>>> from Bio.Data import CodonTable

1422

>>> standard_table = CodonTable.unambiguous_dna_by_name["Standard"]

1423

>>> mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate

1424

Mitochondrial"]

1425

>>

1426

1427

Alternatively, these tables are labeled with ID numbers 1 and 2,

1428

respectively:

1429

<<>>> from Bio.Data import CodonTable

1430

>>> standard_table = CodonTable.unambiguous_dna_by_id[1]

1431

>>> mito_table = CodonTable.unambiguous_dna_by_id[2]

1432

>>

1433

1434

You can compare the actual tables visually by printing them:

1435

<<>>> print standard_table

1436

Table 1 Standard, SGC0

1437

1438

| T | C | A | G |

1439

--+---------+---------+---------+---------+--

1440

1441

1442

1443

1444

--+---------+---------+---------+---------+--

1445

1446

1447

1448

1449

--+---------+---------+---------+---------+--

1450

1451

1452

1453

1454

--+---------+---------+---------+---------+--

1455

1456

1457

1458

1459

--+---------+---------+---------+---------+--

1460

>>

1461

and:

1462

<<>>> print mito_table

1463

Table 2 Vertebrate Mitochondrial, SGC1

1464

1465

| T | C | A | G |

1466

--+---------+---------+---------+---------+--

1467

1468

1469

1470

1471

--+---------+---------+---------+---------+--

1472

1473

1474

1475

1476

--+---------+---------+---------+---------+--

1477

1478

1479

1480

1481

--+---------+---------+---------+---------+--

1482

1483

1484

1485

1486

--+---------+---------+---------+---------+--

1487

>>

1488

1489

You may find these following properties useful -- for example if you

1490

are trying to do your own gene finding:

1491

<<>>> mito_table.stop_codons

1492

['TAA', 'TAG', 'AGA', 'AGG']

1493

>>> mito_table.start_codons

1494

['ATT', 'ATC', 'ATA', 'ATG', 'GTG']

1495

>>> mito_table.forward_table["ACG"]

1496

'T'

1497

>>

1498

1499

1500

1501

3.10 Comparing Seq objects

1502

*=*=*=*=*=*=*=*=*=*=*=*=*=*

1503

1504

1505

Sequence comparison is actually a very complicated topic, and there is

1506

no easy way to decide if two sequences are equal. The basic problem is

1507

the meaning of the letters in a sequence are context dependent - the

1508

letter "A" could be part of a DNA, RNA or protein sequence. Biopython

1509

uses alphabet objects as part of each 'Seq' object to try and capture

1510

this information - so comparing two 'Seq' objects means considering both

1511

the sequence strings and the alphabets.

1512

For example, you might argue that the two DNA 'Seq' objects

1513

Seq("ACGT", IUPAC.unambiguous_dna) and Seq("ACGT", IUPAC.ambiguous_dna)

1514

should be equal, even though they do have different alphabets. Depending

1515

on the context this could be important.

1516

This gets worse -- suppose you think Seq("ACGT",

1517

IUPAC.unambiguous_dna) and Seq("ACGT") (i.e. the default generic

1518

alphabet) should be equal. Then, logically, Seq("ACGT", IUPAC.protein)

1519

and Seq("ACGT") should also be equal. Now, in logic if A=B and B=C, by

1520

transitivity we expect A=C. So for logical consistency we'd require

1521

Seq("ACGT", IUPAC.unambiguous_dna) and Seq("ACGT", IUPAC.protein) to be

1522

equal -- which most people would agree is just not right. This

1523

transitivity problem would also have implications for using 'Seq'

1524

objects as Python dictionary keys.

1525

So, what does Biopython do? Well, the equality test is the default for

1526

Python objects -- it tests to see if they are the same object in memory.

1527

This is a very strict test:

1528

<<>>> from Bio.Seq import Seq

1529

>>> from Bio.Alphabet import IUPAC

1530

>>> seq1 = Seq("ACGT", IUPAC.unambiguous_dna)

1531

>>> seq2 = Seq("ACGT", IUPAC.unambiguous_dna)

1532

>>> seq1 == seq2

1533

False

1534

>>> seq1 == seq1

1535

True

1536

>>

1537

1538

If you actually want to do this, you can be more explicit by using the

1539

Python 'id' function,

1540

<<>>> id(seq1) == id(seq2)

1541

False

1542

>>> id(seq1) == id(seq1)

1543

True

1544

>>

1545

1546

Now, in every day use, your sequences will probably all have the same

1547

alphabet, or at least all be the same type of sequence (all DNA, all

1548

RNA, or all protein). What you probably want is to just compare the

1549

sequences as strings -- so do this explicitly:

1550

<<>>> str(seq1) == str(seq2)

1551

True

1552

>>> str(seq1) == str(seq1)

1553

True

1554

>>

1555

1556

As an extension to this, while you can use a Python dictionary with

1557

'Seq' objects as keys, it is generally more useful to use the sequence a

1558

string for the key. See also Section 3.4.

1559

1560

1561

3.11 MutableSeq objects

1562

*=*=*=*=*=*=*=*=*=*=*=*=

1563

1564

1565

Just like the normal Python string, the 'Seq' object is "read only",

1566

or in Python terminology, immutable. Apart from wanting the 'Seq' object

1567

to act like a string, this is also a useful default since in many

1568

biological applications you want to ensure you are not changing your

1569

sequence data:

1570

<<>>> from Bio.Seq import Seq

1571

>>> from Bio.Alphabet import IUPAC

1572

>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",

1573

IUPAC.unambiguous_dna)

1574

>>> my_seq[5] = "G"

1575

Traceback (most recent call last):

1576

File "<stdin>", line 1, in ?

1577

AttributeError: 'Seq' instance has no attribute '__setitem__'

1578

>>

1579

1580

However, you can convert it into a mutable sequence (a 'MutableSeq'

1581

object) and do pretty much anything you want with it:

1582

<<>>> mutable_seq = my_seq.tomutable()

1583

>>> mutable_seq

1584

MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())

1585

>>

1586

1587

Alternatively, you can create a 'MutableSeq' object directly from a

1588

string:

1589

<<>>> from Bio.Seq import MutableSeq

1590

>>> from Bio.Alphabet import IUPAC

1591

>>> mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",

1592

IUPAC.unambiguous_dna)

1593

>>

1594

1595

Either way will give you a sequence object which can be changed:

1596

<<>>> mutable_seq

1597

MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())

1598

>>> mutable_seq[5] = "T"

1599

>>> mutable_seq

1600

MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())

1601

>>> mutable_seq.remove("T")

1602

>>> mutable_seq

1603

MutableSeq('GCCATGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())

1604

>>> mutable_seq.reverse()

1605

>>> mutable_seq

1606

MutableSeq('AGCCCGTGGGAAAGTCGCCGGGTAATGTACCG', IUPACUnambiguousDNA())

1607

>>

1608

1609

Do note that unlike the 'Seq' object, the 'MutableSeq' object's

1610

methods like 'reverse_complement()' and 'reverse()' act in-situ!

1611

An important technical difference between mutable and immutable

1612

objects in Python means that you can't use a 'MutableSeq' object as a

1613

dictionary key, but you can use a Python string or a 'Seq' object in

1614

this way.

1615

Once you have finished editing your a 'MutableSeq' object, it's easy

1616

to get back to a read-only 'Seq' object should you need to:

1617

<<>>> new_seq = mutable_seq.toseq()

1618

>>> new_seq

1619

Seq('AGCCCGTGGGAAAGTCGCCGGGTAATGTACCG', IUPACUnambiguousDNA())

1620

>>

1621

1622

You can also get a string from a 'MutableSeq' object just like from a

1623

'Seq' object (Section 3.4).

1624

1625

1626

3.12 UnknownSeq objects

1627

*=*=*=*=*=*=*=*=*=*=*=*=

1628

1629

Biopython 1.50 introduced another basic sequence object, the

1630

'UnknownSeq' object. This is a subclass of the basic 'Seq' object and

1631

its purpose is to represent a sequence where we know the length, but not

1632

the actual letters making it up. You could of course use a normal 'Seq'

1633

object in this situation, but it wastes rather a lot of memory to hold a

1634

string of a million "N" characters when you could just store a single

1635

letter "N" and the desired length as an integer.

1636

<<>>> from Bio.Seq import UnknownSeq

1637

>>> unk = UnknownSeq(20)

1638

>>> unk

1639

UnknownSeq(20, alphabet = Alphabet(), character = '?')

1640

>>> print unk

1641

????????????????????

1642

>>> len(unk)

1643

20

1644

>>

1645

1646

You can of course specify an alphabet, meaning for nucleotide

1647

sequences the letter defaults to "N" and for proteins "X", rather than

1648

just "?".

1649

<<>>> from Bio.Seq import UnknownSeq

1650

>>> from Bio.Alphabet import IUPAC

1651

>>> unk_dna = UnknownSeq(20, alphabet=IUPAC.ambiguous_dna)

1652

>>> unk_dna

1653

UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')

1654

>>> print unk_dna

1655

NNNNNNNNNNNNNNNNNNNN

1656

>>

1657

1658

You can use all the usual 'Seq' object methods too, note these give

1659

back memory saving 'UnknownSeq' objects where appropriate as you might

1660

expect:

1661

<<>>> unk_dna

1662

UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')

1663

>>> unk_dna.complement()

1664

UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')

1665

>>> unk_dna.reverse_complement()

1666

UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')

1667

>>> unk_dna.transcribe()

1668

UnknownSeq(20, alphabet = IUPACAmbiguousRNA(), character = 'N')

1669

>>> unk_protein = unk_dna.translate()

1670

>>> unk_protein

1671

UnknownSeq(6, alphabet = ProteinAlphabet(), character = 'X')

1672

>>> print unk_protein

1673

XXXXXX

1674

>>> len(unk_protein)

1675

6

1676

>>

1677

1678

You may be able to find a use for the 'UnknownSeq' object in your own

1679

code, but it is more likely that you will first come across them in a

1680

'SeqRecord' object created by 'Bio.SeqIO' (see Chapter 5). Some sequence

1681

file formats don't always include the actual sequence, for example

1682

GenBank and EMBL files may include a list of features but for the

1683

sequence just present the contig information. Alternatively, the QUAL

1684

files used in sequencing work hold quality scores but they never contain

1685

a sequence -- instead there is a partner FASTA file which does have the

1686

sequence.

1687

1688

1689

3.13 Working with directly strings

1690

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

1691

1692

To close this chapter, for those you who really don't want to use

1693

the sequence objects (or who prefer a functional programming style to an

1694

object orientated one), there are module level functions in 'Bio.Seq'

1695

will accept plain Python strings, 'Seq' objects (including 'UnknownSeq'

1696

objects) or 'MutableSeq' objects:

1697

<<>>> from Bio.Seq import reverse_complement, transcribe,

1698

back_transcribe, translate

1699

>>> my_string = "GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG"

1700

>>> reverse_complement(my_string)

1701

'CTAACCAGCAGCACGACCACCCTTCCAACGACCCATAACAGC'

1702

>>> transcribe(my_string)

1703

'GCUGUUAUGGGUCGUUGGAAGGGUGGUCGUGCUGCUGGUUAG'

1704

>>> back_transcribe(my_string)

1705

'GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG'

1706

>>> translate(my_string)

1707

'AVMGRWKGGRAAG*'

1708

>>

1709

1710

You are, however, encouraged to work with 'Seq' objects by default.

1711

-----------------------------------

1712

1713

1714

(1) http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

1715

1716

1717

Chapter 4 Sequence Record objects

1718

************************************

1719

1720

Chapter 3 introduced the sequence classes. Immediately "above" the

1721

'Seq' class is the Sequence Record or 'SeqRecord' class, defined in the

1722

'Bio.SeqRecord' module. This class allows higher level features such as

1723

identifiers and features to be associated with the sequence, and is used

1724

thoughout the sequence input/output interface 'Bio.SeqIO' described

1725

fully in Chapter 5.

1726

If you are only going to be working with simple data like FASTA files,

1727

you can probably skip this chapter for now. If on the other hand you are

1728

going to be using richly annotated sequence data, say from GenBank or

1729

EMBL files, this information is quite important.

1730

While this chapter should cover most things to do with the 'SeqRecord'

1731

object in this chapter, you may also want to read the 'SeqRecord' wiki

1732

page (http://biopython.org/wiki/SeqRecord), and the built in

1733

documentation (also online (1)):

1734

<<>>> from Bio.SeqRecord import SeqRecord

1735

>>> help(SeqRecord)

1736

...

1737

>>

1738

1739

1740

1741

4.1 The SeqRecord object

1742

*=*=*=*=*=*=*=*=*=*=*=*=*

1743

1744

1745

The 'SeqRecord' (Sequence Record) class is defined in the

1746

'Bio.SeqRecord' module. This class allows higher level features such as

1747

identifiers and features to be associated with a sequence (see

1748

Chapter 3), and is the basic data type for the 'Bio.SeqIO' sequence

1749

input/output interface (see Chapter 5).

1750

The 'SeqRecord' class itself is quite simple, and offers the following

1751

information as attributes:

1752

1753

1754

seq -- The sequence itself, typically a 'Seq' object.

1755

1756

id -- The primary ID used to identify the sequence -- a string. In

1757

most cases this is something like an accession number.

1758

1759

name -- A "common" name/id for the sequence -- a string. In some cases

1760

this will be the same as the accession number, but it could also be a

1761

clone name. I think of this as being analagous to the LOCUS id in a

1762

GenBank record.

1763

1764

description -- A human readible description or expressive name for the

1765

sequence -- a string.

1766

1767

letter_annotations -- Holds per-letter-annotations using a

1768

(restricted) dictionary of additional information about the letters

1769

in the sequence. The keys are the name of the information, and the

1770

information is contained in the value as a Python sequence (i.e. a

1771

list, tuple or string) with the same length as the sequence itself.

1772

This is often used for quality scores (e.g. Section 14.1.3) or

1773

secondary structure information (e.g. from Stockholm/PFAM alignment

1774

files).

1775

1776

annotations -- A dictionary of additional information about the

1777

sequence. The keys are the name of the information, and the

1778

information is contained in the value. This allows the addition of

1779

more "unstructed" information to the sequence.

1780

1781

features -- A list of 'SeqFeature' objects with more structured

1782

information about the features on a sequence (e.g. position of genes

1783

on a genome, or domains on a protein sequence). The structure of

1784

sequence features is described below in Section 4.3.

1785

1786

dbxrefs - A list of database cross-references as strings.

1787

1788

1789

1790

4.2 Creating a SeqRecord

1791

*=*=*=*=*=*=*=*=*=*=*=*=*

1792

1793

1794

Using a 'SeqRecord' object is not very complicated, since all of the

1795

information is presented as attributes of the class. Usually you won't

1796

create a 'SeqRecord' "by hand", but instead use 'Bio.SeqIO' to read in a

1797

sequence file for you (see Chapter 5 and the examples below). However,

1798

creating 'SeqRecord' can be quite simple.

1799

1800

1801

4.2.1 SeqRecord objects from scratch

1802

=====================================

1803

1804

To create a 'SeqRecord' at a minimum you just need a 'Seq' object:

1805

<<>>> from Bio.Seq import Seq

1806

>>> simple_seq = Seq("GATC")

1807

>>> from Bio.SeqRecord import SeqRecord

1808

>>> simple_seq_r = SeqRecord(simple_seq)

1809

>>

1810

1811

Additionally, you can also pass the id, name and description to the

1812

initialization function, but if not they will be set as strings

1813

indicating they are unknown, and can be modified subsequently:

1814

<<>>> simple_seq_r.id

1815

'<unknown id>'

1816

>>> simple_seq_r.id = "AC12345"

1817

>>> simple_seq_r.description = "Made up sequence I wish I could write

1818

a paper about"

1819

>>> print simple_seq_r.description

1820

Made up sequence I wish I could write a paper about

1821

>>> simple_seq_r.seq

1822

Seq('GATC', Alphabet())

1823

>>

1824

1825

Including an identifier is very important if you want to output your

1826

'SeqRecord' to a file. You would normally include this when creating the

1827

object:

1828

<<>>> from Bio.Seq import Seq

1829

>>> simple_seq = Seq("GATC")

1830

>>> from Bio.SeqRecord import SeqRecord

1831

>>> simple_seq_r = SeqRecord(simple_seq, id="AC12345")

1832

>>

1833

1834

As mentioned above, the 'SeqRecord' has an dictionary attribute

1835

'annotations'. This is used for any miscellaneous annotations that

1836

doesn't fit under one of the other more specific attributes. Adding

1837

annotations is easy, and just involves dealing directly with the

1838

annotation dictionary:

1839

<<>>> simple_seq_r.annotations["evidence"] = "None. I just made it up."

1840

>>> print simple_seq_r.annotations

1841

{'evidence': 'None. I just made it up.'}

1842

>>> print simple_seq_r.annotations["evidence"]

1843

None. I just made it up.

1844

>>

1845

1846

Working with per-letter-annotations is similar, 'letter_annotations'

1847

is a dictionary like attribute which will let you assign any Python

1848

sequence (i.e. a string, list or tuple) which has the same length as the

1849

sequence:

1850

<<>>> simple_seq_r.letter_annotations["phred_quality"] = [40,40,38,30]

1851

>>> print simple_seq_r. letter_annotations

1852

{'phred_quality': [40, 40, 38, 30]}

1853

>>> print simple_seq_r.letter_annotations["phred_quality"]

1854

[40, 40, 38, 30]

1855

>>

1856

1857

The 'dbxrefs' and 'features' attributes are just Python lists, and

1858

should be used to store strings and 'SeqFeature' objects (discussed

1859

later in this chapter) respectively.

1860

1861

1862

4.2.2 SeqRecord objects from FASTA files

1863

=========================================

1864

1865

This example uses a fairly large FASTA file containing the whole

1866

sequence for Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1,

1867

originally downloaded from the NCBI. This file is included with the

1868

Biopython unit tests under the GenBank folder, or online

1869

NC_005816.fna (2) from our website.

1870

The file starts like this - and you can check there is only one record

1871

present (i.e. only one line starting with a greater than symbol):

1872

<<>gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ...

1873

pPCP1, complete sequence

1874

TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCC

1875

...

1876

>>

1877

1878

Back in Chapter 2 you will have seen the function

1879

'Bio.SeqIO.parse(...)' used to loop over all the records in a file as

1880

'SeqRecord' objects. The 'Bio.SeqIO' module has a sister function for

1881

use on files which contain just one record which we'll use here (see

1882

Chapter 5 for details):

1883

<<>>> from Bio import SeqIO

1884

>>> record = SeqIO.read(open("NC_005816.fna"), "fasta")

1885

>>> record

1886

SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCC

1887

AGG...CTG',

1888

SingleLetterAlphabet()), id='gi|45478711|ref|NC_005816.1|',

1889

name='gi|45478711|ref|NC_005816.1|',

1890

description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar

1891

Microtus ... sequence',

1892

dbxrefs=[])

1893

>>

1894

1895

Now, let's have a look at the key attributes of this 'SeqRecord'

1896

individually -- starting with the 'seq' attribute which gives you a

1897

'Seq' object:

1898

<<>>> record.seq

1899

Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',

1900

SingleLetterAlphabet())

1901

>>

1902

1903

Here 'Bio.SeqIO' has defaulted to a generic alphabet, rather than

1904

guessing that this is DNA. If you know in advance what kind of sequence

1905

your FASTA file contains, you can tell 'Bio.SeqIO' which alphabet to use

1906

(see Chapter 5).

1907

Next, the identifiers and description:

1908

<<>>> record.id

1909

'gi|45478711|ref|NC_005816.1|'

1910

>>> record.name

1911

'gi|45478711|ref|NC_005816.1|'

1912

>>> record.description

1913

'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ...

1914

pPCP1, complete sequence'

1915

>>

1916

1917

As you can see above, the first word of the FASTA record's title line

1918

(after removing the greater than symbol) is used for both the 'id' and

1919

'name' attributes. The whole title line (after removing the greater than

1920

symbol) is used for the record description. This is deliberate, partly

1921

for backwards compatibility reasons, but it also makes sense if you have

1922

a FASTA file like this:

1923

<<>Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1

1924

TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCC

1925

...

1926

>>

1927

1928

Note that none of the other annotation attributes get populated when

1929

reading a FASTA file:

1930

<<>>> record.dbxrefs

1931

[]

1932

>>> record.annotations

1933

{}

1934

>>> record.letter_annotations

1935

{}

1936

>>> record.features

1937

[]

1938

>>

1939

1940

In this case our example FASTA file was from the NCBI, and they have a

1941

fairly well defined set of conventions for formatting their FASTA lines.

1942

This means it would be possible to parse this information and extract

1943

the GI number and accession for example. However, FASTA files from other

1944

sources vary, so this isn't possible in general.

1945

1946

1947

4.2.3 SeqRecord objects from GenBank files

1948

===========================================

1949

1950

As in the previous example, we're going to look at the whole sequence

1951

for Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, originally

1952

downloaded from the NCBI, but this time as a GenBank file. Again, this

1953

file is included with the Biopython unit tests under the GenBank folder,

1954

or online NC_005816.gb (3) from our website.

1955

This file contains a single record (i.e. only one LOCUS line) and

1956

starts:

1957

<<LOCUS NC_005816 9609 bp DNA circular BCT

1958

21-JUL-2008

1959

DEFINITION Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1,

1960

complete

1961

sequence.

1962

ACCESSION NC_005816

1963

VERSION NC_005816.1 GI:45478711

1964

PROJECT GenomeProject:10638

1965

...

1966

>>

1967

1968

Again, we'll use 'Bio.SeqIO' to read this file in, and the code is

1969

almost identical to that for used above for the FASTA file (see

1970

Chapter 5 for details):

1971

<<>>> from Bio import SeqIO

1972

>>> record = SeqIO.read(open("NC_005816.gb"), "genbank")

1973

>>> record

1974

SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCC

1975

AGG...CTG',

1976

IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',

1977

description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1,

1978

complete sequence.',

1979

dbxrefs=['Project:10638'])

1980

>>

1981

1982

You should be able to spot some differences already! But taking the

1983

attributes individually, the sequence string is the same as before, but

1984

this time 'Bio.SeqIO' has been able to automatically assign a more

1985

specific alphabet (see Chapter 5 for details):

1986

<<>>> record.seq

1987

Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',

1988

IUPACAmbiguousDNA())

1989

>>

1990

1991

The 'name' comes from the LOCUS line, while the 'id' includes the

1992

version suffix. The description comes from the DEFINITION line:

1993

<<>>> record.id

1994

'NC_005816.1'

1995

>>> record.name

1996

'NC_005816'

1997

>>> record.description

1998

'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete

1999

sequence.'

2000

>>

2001

2002

GenBank files don't have any per-letter annotations:

2003

<<>>> record.letter_annotations

2004

{}

2005

>>

2006

2007

Most of the annotations information gets recorded in the 'annotations'

2008

dictionary, for example:

2009

<<>>> len(record.annotations)

2010

11

2011

>>> record.annotations["source"]

2012

'Yersinia pestis biovar Microtus str. 91001'

2013

>>

2014

2015

The 'dbxrefs' list gets populated from any PROJECT or DBLINK lines:

2016

<<>>> record.dbxrefs

2017

['Project:10638']

2018

>>

2019

2020

Finally, and perhaps most interestingly, all the entries in the

2021

features table (e.g. the genes or CDS features) get recorded as

2022

'SeqFeature' objects in the 'features' list.

2023

<<>>> len(record.features)

2024

29

2025

>>

2026

2027

We'll talk about 'SeqFeature' objects next, in Section 4.3.

2028

2029

2030

4.3 SeqFeature objects

2031

*=*=*=*=*=*=*=*=*=*=*=*

2032

2033

2034

Sequence features are an essential part of describing a sequence. Once

2035

you get beyond the sequence itself, you need some way to organize and

2036

easily get at the more "abstract" information that is known about the

2037

sequence. While it is probably impossible to develop a general sequence

2038

feature class that will cover everything, the Biopython 'SeqFeature'

2039

class attempts to encapsulate as much of the information about the

2040

sequence as possible. The design is heavily based on the GenBank/EMBL

2041

feature tables, so if you understand how they look, you'll probably have

2042

an easier time grasping the structure of the Biopython classes.

2043

2044

2045

4.3.1 SeqFeatures themselves

2046

=============================

2047

2048

The first level of dealing with sequence features is the 'SeqFeature'

2049

class itself. This class has a number of attributes, so first we'll list

2050

them and their general features, and then work through an example to

2051

show how this applies to a real life example, a GenBank feature table.

2052

The attributes of a SeqFeature are:

2053

2054

2055

location -- The location of the 'SeqFeature' on the sequence that you

2056

are dealing with. The locations end-points may be fuzzy --

2057

section 4.3.2 has a lot more description on how to deal with

2058

descriptions.

2059

2060

type -- This is a textual description of the type of feature (for

2061

instance, this will be something like `CDS' or `gene').

2062

2063

ref -- A reference to a different sequence. Some times features may be

2064

"on" a particular sequence, but may need to refer to a different

2065

sequence, and this provides the reference (normally an accession

2066

number). A good example of this is a genomic sequence that has most

2067

of a coding sequence, but one of the exons is on a different

2068

accession. In this case, the feature would need to refer to this

2069

different accession for this missing exon. You are most likely to see

2070

this in contig GenBank files.

2071

2072

ref_db -- This works along with 'ref' to provide a cross sequence

2073

reference. If there is a reference, 'ref_db' will be set as None if

2074

the reference is in the same database, and will be set to the name of

2075

the database otherwise.

2076

2077

strand -- The strand on the sequence that the feature is located on.

2078

This may either be 1 for the top strand, -1 for the bottom strand, or

2079

0 or None for both strands (or if it doesn't matter). Keep in mind

2080

that this only really makes sense for double stranded DNA, and not

2081

for proteins or RNA.

2082

2083

qualifiers -- This is a Python dictionary of additional information

2084

about the feature. The key is some kind of terse one-word description

2085

of what the information contained in the value is about, and the

2086

value is the actual information. For example, a common key for a

2087

qualifier might be "evidence" and the value might be "computational

2088

(non-experimental)." This is just a way to let the person who is

2089

looking at the feature know that it has not be experimentally

2090

(i. e. in a wet lab) confirmed. Note that other the value will be a

2091

list of strings (even when there is only one string). This is a

2092

reflection of the feature tables in GenBank/EMBL files.

2093

2094

sub_features -- A very important feature of a feature is that it can

2095

have additional 'sub_features' underneath it. This allows nesting of

2096

features, and helps us to deal with things such as the GenBank/EMBL

2097

feature lines in a (we hope) intuitive way.

2098

2099

To show an example of SeqFeatures in action, let's take a look at the

2100

following feature from a GenBank feature table:

2101

<< mRNA complement(join(<49223..49300,49780..>50208))

2102

/gene="F28B23.12"

2103

>>

2104

2105

To look at the easiest attributes of the 'SeqFeature' first, if you

2106

got a 'SeqFeature' object for this it would have it 'type' of 'mRNA', a

2107

'strand' of -1 (due to the `complement'), and would have None for the

2108

'ref' and 'ref_db' since there are no references to external databases.

2109

The 'qualifiers' for this SeqFeature would be a Python dictionarary that

2110

looked like '{'gene' : ['F28B23.12']}'.

2111

Now let's look at the more tricky part, how the `join' in the location

2112

line is handled. First, the location for the top level 'SeqFeature' (the

2113

one we are dealing with right now) is set as going from '`<49223' to

2114

`>50208'' (see section 4.3.2 for the nitty gritty on how fuzzy locations

2115

like this are handled). So the location of the top level object is the

2116

entire span of the feature. So, how do you get at the information in the

2117

`join'? Well, that's where the 'sub_features' go in.

2118

The 'sub_features' attribute will have a list with two 'SeqFeature'

2119

objects in it, and these contain the information in the join. Let's look

2120

at 'top_level_feature.sub_features[0]' (the first 'sub_feature'). This

2121

object is a 'SeqFeature' object with a 'type' of `'mRNA',' a 'strand' of

2122

-1 (inherited from the parent 'SeqFeature') and a location going from

2123

''<49223' to '49300''.

2124

So, the 'sub_features' allow you to get at the internal information if

2125

you want it (i. e. if you were trying to get only the exons out of a

2126

genomic sequence), or just to deal with the broad picture (i. e. you

2127

just want to know that the coding sequence for a gene lies in a region).

2128

Hopefully this structuring makes it easy and intuitive to get at the

2129

sometimes complex information that can be contained in a 'SeqFeature'.

2130

2131

2132

4.3.2 Locations

2133

================

2134

2135

In the section on SeqFeatures above, we skipped over one of the more

2136

difficult parts of features, dealing with the locations. The reason this

2137

can be difficult is because of fuzziness of the positions in locations.

2138

Before we get into all of this, let's just define the vocabulary we'll

2139

use to talk about this. Basically there are two terms we'll use:

2140

2141

2142

position -- This refers to a single position on a sequence, which may

2143

be fuzzy or not. For instance, 5, 20, '<100' and '3^5' are all

2144

positions.

2145

2146

location -- A location is two positions that defines a region of a

2147

sequence. For instance 5..20 (i. e. 5 to 20) is a location.

2148

2149

I just mention this because sometimes I get confused between the two.

2150

The complication in dealing with locations comes in the positions

2151

themselves. In biology many times things aren't entirely certain (as

2152

much as us wet lab biologists try to make them certain!). For instance,

2153

you might do a dinucleotide priming experiment and discover that the

2154

start of mRNA transcript starts at one of two sites. This is very useful

2155

information, but the complication comes in how to represent this as a

2156

position. To help us deal with this, we have the concept of fuzzy

2157

positions. Basically there are five types of fuzzy positions, so we have

2158

five classes do deal with them:

2159

2160

2161

ExactPosition -- As its name suggests, this class represents a

2162

position which is specified as exact along the sequence. This is

2163

represented as just a a number, and you can get the position by

2164

looking at the 'position' attribute of the object.

2165

2166

BeforePosition -- This class represents a fuzzy position that occurs

2167

prior to some specified site. In GenBank/EMBL notation, this is

2168

represented as something like '`<13'', signifying that the real

2169

position is located somewhere less then 13. To get the specified

2170

upper boundary, look at the 'position' attribute of the object.

2171

2172

AfterPosition -- Contrary to 'BeforePosition', this class represents a

2173

position that occurs after some specified site. This is represented

2174

in GenBank as '`>13'', and like 'BeforePosition', you get the

2175

boundary number by looking at the 'position' attribute of the object.

2176

2177

WithinPosition -- This class models a position which occurs somewhere

2178

between two specified nucleotides. In GenBank/EMBL notation, this

2179

would be represented as `(1.5)', to represent that the position is

2180

somewhere within the range 1 to 5. To get the information in this

2181

class you have to look at two attributes. The 'position' attribute

2182

specifies the lower boundary of the range we are looking at, so in

2183

our example case this would be one. The 'extension' attribute

2184

specifies the range to the higher boundary, so in this case it would

2185

be 4. So 'object.position' is the lower boundary and 'object.position

2186

+ object.extension' is the upper boundary.

2187

2188

BetweenPosition -- This class deals with a position that occurs

2189

between two coordinates. For instance, you might have a protein

2190

binding site that occurs between two nucleotides on a sequence. This

2191

is represented as '`2^3'', which indicates that the real position

2192

happens between position 2 and 3. Getting this information from the

2193

object is very similar to 'WithinPosition', the 'position' attribute

2194

specifies the lower boundary (2, in this case) and the 'extension'

2195

indicates the range to the higher boundary (1 in this case).

2196

2197

Now that we've got all of the types of fuzzy positions we can have

2198

taken care of, we are ready to actually specify a location on a

2199

sequence. This is handled by the 'FeatureLocation' class. An object of

2200

this type basically just holds the potentially fuzzy start and end

2201

positions of a feature. You can create a 'FeatureLocation' object by

2202

creating the positions and passing them in:

2203

<<>>> from Bio import SeqFeature

2204

>>> start_pos = SeqFeature.AfterPosition(5)

2205

>>> end_pos = SeqFeature.BetweenPosition(8, 1)

2206

>>> my_location = SeqFeature.FeatureLocation(start_pos, end_pos)

2207

>>

2208

2209

If you print out a 'FeatureLocation' object, you can get a nice

2210

representation of the information:

2211

<<>>> print my_location

2212

[>5:(8^9)]

2213

>>

2214

2215

We can access the fuzzy start and end positions using the start and

2216

end attributes of the location:

2217

<<>>> my_location.start

2218

Bio.SeqFeature.AfterPosition(5)

2219

>>> print my_location.start

2220

>5

2221

>>> my_location.end

2222

Bio.SeqFeature.BetweenPosition(8,1)

2223

>>> print my_location.end

2224

(8^9)

2225

>>

2226

2227

If you don't want to deal with fuzzy positions and just want numbers,

2228

you just need to ask for the 'nofuzzy_start' and 'nofuzzy_end'

2229

attributes of the location:

2230

<<>>> my_location.nofuzzy_start

2231

5

2232

>>> my_location.nofuzzy_end

2233

8

2234

>>

2235

2236

Notice that this just gives you back the position attributes of the

2237

fuzzy locations.

2238

Similary, to make it easy to create a position without worrying about

2239

fuzzy positions, you can just pass in numbers to the 'FeaturePosition'

2240

constructors, and you'll get back out 'ExactPosition' objects:

2241

<<>>> exact_location = SeqFeature.FeatureLocation(5, 8)

2242

>>> print exact_location

2243

[5:8]

2244

>>> exact_location.start

2245

Bio.SeqFeature.ExactPosition(5)

2246

>>> exact_location.nofuzzy_start

2247

5

2248

>>

2249

2250

That is all of the nitty gritty about dealing with fuzzy positions in

2251

Biopython. It has been designed so that dealing with fuzziness is not

2252

that much more complicated than dealing with exact positions, and

2253

hopefully you find that true!

2254

2255

2256

4.4 References

2257

*=*=*=*=*=*=*=*

2258

2259

2260

Another common annotation related to a sequence is a reference to a

2261

journal or other published work dealing with the sequence. We have a

2262

fairly simple way of representing a Reference in Biopython -- we have a

2263

'Bio.SeqFeature.Reference' class that stores the relevant information

2264

about a reference as attributes of an object.

2265

The attributes include things that you would expect to see in a

2266

reference like 'journal', 'title' and 'authors'. Additionally, it also

2267

can hold the 'medline_id' and 'pubmed_id' and a 'comment' about the

2268

reference. These are all accessed simply as attributes of the object.

2269

A reference also has a 'location' object so that it can specify a

2270

particular location on the sequence that the reference refers to. For

2271

instance, you might have a journal that is dealing with a particular

2272

gene located on a BAC, and want to specify that it only refers to this

2273

position exactly. The 'location' is a potentially fuzzy location, as

2274

described in section 4.3.2.

2275

Any reference objects are stored as a list in the 'SeqRecord' object's

2276

'annotations' dictionary under the key "references". That's all there is

2277

too it. References are meant to be easy to deal with, and hopefully

2278

general enough to cover lots of usage cases.

2279

2280

2281

4.5 The format method

2282

*=*=*=*=*=*=*=*=*=*=*=

2283

2284

2285

Biopython 1.48 added a new 'format()' method to the 'SeqRecord' class

2286

which gives a string containing your record formatted using one of the

2287

output file formats supported by 'Bio.SeqIO', such as FASTA:

2288

<<from Bio.Seq import Seq

2289

from Bio.SeqRecord import SeqRecord

2290

from Bio.Alphabet import generic_protein

2291

2292

record =

2293

SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQAL

2294

FGD" \

2295

2296

+"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \

2297

2298

+"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \

2299

+"SSAC", generic_protein),

2300

id="gi|14150838|gb|AAK54648.1|AF376133_1",

2301

description="chalcone synthase [Cucumis sativus]")

2302

2303

print record.format("fasta")

2304

>>

2305

which should give:

2306

<<>gi|14150838|gb|AAK54648.1|AF376133_1 chalcone synthase [Cucumis

2307

sativus]

2308

MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD

2309

GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK

2310

NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM

2311

SSAC

2312

>>

2313

2314

This 'format' method takes a single mandatory argument, a lower case

2315

string which is supported by 'Bio.SeqIO' as an output format (see

2316

Chapter 5). However, some of the file formats 'Bio.SeqIO' can write to

2317

require more than one record (typically the case for multiple sequence

2318

alignment formats), and thus won't work via this 'format()' method. See

2319

also Section 5.4.3.

2320

2321

2322

4.6 Slicing a SeqRecord

2323

*=*=*=*=*=*=*=*=*=*=*=*=

2324

2325

2326

One of the new features in Biopython 1.50 was the ability to slice a

2327

'SeqRecord', to give you a new 'SeqRecord' covering just part of the

2328

sequence. What is important here is that any per-letter annotations are

2329

also sliced, and any features which fall completely within the new

2330

sequence are preserved (with their locations adjusted).

2331

For example, taking the same GenBank file used earlier:

2332

<<>>> from Bio import SeqIO

2333

>>> record = SeqIO.read(open("NC_005816.gb"), "genbank")

2334

>>> record

2335

SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCC

2336

AGG...CTG',

2337

IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',

2338

description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1,

2339

complete sequence.',

2340

dbxrefs=['Project:10638'])

2341

>>> len(record)

2342

9609

2343

>>> len(record.features)

2344

29

2345

>>

2346

2347

For this example we're going to focus in on the 'pim' gene,

2348

'YP_pPCP05'. If you have a look at the GenBank file directly you'll find

2349

this gene/CDS has location string 4343..4780, or in Python counting

2350

4342:4780. From looking at the file you can work out that these are the

2351

twelfth and thirteenth entries in the file, so in Python zero-based

2352

counting they are entries 11 and 12 in the features list:

2353

<<>>> print record.features[11]

2354

type: gene

2355

location: [4342:4780]

2356

ref: None:None

2357

strand: 1

2358

qualifiers:

2359

Key: db_xref, Value: ['GeneID:2767712']

2360

Key: gene, Value: ['pim']

2361

Key: locus_tag, Value: ['YP_pPCP05']

2362

2363

>>> print record.features[12]

2364

type: CDS

2365

location: [4342:4780]

2366

ref: None:None

2367

strand: 1

2368

qualifiers:

2369

Key: codon_start, Value: ['1']

2370

Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']

2371

Key: gene, Value: ['pim']

2372

Key: locus_tag, Value: ['YP_pPCP05']

2373

Key: note, Value: ['similar to many previously sequenced pesticin

2374

immunity ...']

2375

Key: product, Value: ['pesticin immunity protein']

2376

Key: protein_id, Value: ['NP_995571.1']

2377

Key: transl_table, Value: ['11']

2378

Key: translation, Value:

2379

['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']

2380

>>

2381

2382

Let's slice this parent record from 4300 to 4800 (enough to include

2383

the 'pim' gene/CDS), and see how many features we get:

2384

<<>>> sub_record = record[4300:4800]

2385

>>> sub_record

2386

SeqRecord(seq=Seq('ATAAATAGATTATTCCAAATAATTTATTTATGTAAGAACAGGATGGGAGGG

2387

GGA...TTA',

2388

IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',

2389

description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1,

2390

complete sequence.',

2391

dbxrefs=[])

2392

>>> len(sub_record)

2393

500

2394

>>> len(sub_record.features)

2395

2

2396

>>

2397

2398

Our sub-record just has two features, the gene and CDS entries for

2399

'YP_pPCP05':

2400

<<>>> print sub_record.features[0]

2401

type: gene

2402

location: [42:480]

2403

ref: None:None

2404

strand: 1

2405

qualifiers:

2406

Key: db_xref, Value: ['GeneID:2767712']

2407

Key: gene, Value: ['pim']

2408

Key: locus_tag, Value: ['YP_pPCP05']

2409

2410

>>> print sub_record.features[1]

2411

type: CDS

2412

location: [42:480]

2413

ref: None:None

2414

strand: 1

2415

qualifiers:

2416

Key: codon_start, Value: ['1']

2417

Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']

2418

Key: gene, Value: ['pim']

2419

Key: locus_tag, Value: ['YP_pPCP05']

2420

Key: note, Value: ['similar to many previously sequenced pesticin

2421

immunity ...']

2422

Key: product, Value: ['pesticin immunity protein']

2423

Key: protein_id, Value: ['NP_995571.1']

2424

Key: transl_table, Value: ['11']

2425

Key: translation, Value:

2426

['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']

2427

>>

2428

2429

Notice that their locations have been adjusted to reflect the new

2430

parent sequence!

2431

While Biopython has done something sensible and hopefully intuitive

2432

with the features (and any per-letter annotation), for the other

2433

annotation it is impossible to know if this still applies to the

2434

sub-sequence or not. To avoid guessing, the annotations and dbxrefs are

2435

omitted from the sub-record, and it is up to you to transfer any

2436

relevant information as appropriate.

2437

<<>>> sub_record.annotations

2438

{}

2439

>>> sub_record.dbxrefs

2440

[]

2441

>>

2442

2443

The same point could be made about the record id, name and

2444

description, but for practicality these are preserved:

2445

<<>>> sub_record.id

2446

'NC_005816.1'

2447

>>> sub_record.name

2448

'NC_005816'

2449

>>> sub_record.description

2450

'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete

2451

sequence.'

2452

>>

2453

2454

This illustrates the problem nicely though, our new sub-record is not

2455

the complete sequence of the plasmid, so the description is wrong! Let's

2456

fix this and then view the sub-record as a reduced GenBank file using

2457

the format method described above in Section 4.5:

2458

<<>>> sub_record.description = "Yersinia pestis biovar Microtus str.

2459

91001 plasmid pPCP1, partial."

2460

>>> print sub_record.format("genbank")

2461

...

2462

>>

2463

2464

See Sections 14.1.4 and 14.1.5 for some FASTQ example where the

2465

per-letter annotations (the read quality scores) are also sliced.

2466

-----------------------------------

2467

2468

2469

(1) http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.ht

2470

ml

2471

2472

(2) http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.fna

2473

2474

(3) http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.gb

2475

2476

2477

Chapter 5 Sequence Input/Output

2478

**********************************

2479

2480

In this chapter we'll discuss in more detail the 'Bio.SeqIO' module,

2481

which was briefly introduced in Chapter 2 and also used in Chapter 4.

2482

This aims to provide a simple interface for working with assorted

2483

sequence file formats in a uniform way. See also the 'Bio.SeqIO' wiki

2484

page (http://biopython.org/wiki/SeqIO), and the built in documentation

2485

(also online (1)):

2486

<<>>> from Bio import SeqIO

2487

>>> help(SeqIO)

2488

...

2489

>>

2490

2491

The "catch" is that you have to work with 'SeqRecord' objects (see

2492

Chapter 4), which contain a 'Seq' object (see Chapter 3) plus annotation

2493

like an identifier and description.

2494

2495

2496

5.1 Parsing or Reading Sequences

2497

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

2498

2499

2500

The workhorse function 'Bio.SeqIO.parse()' is used to read in sequence

2501

data as SeqRecord objects. This function expects two arguments:

2502

2503

2504

1. The first argument is a handle to read the data from. A handle is

2505

typically a file opened for reading, but could be the output from a

2506

command line program, or data downloaded from the internet (see

2507

Section 5.2). See Section 18.1 for more about handles.

2508

2. The second argument is a lower case string specifying sequence

2509

format -- we don't try and guess the file format for you! See

2510

http://biopython.org/wiki/SeqIO for a full listing of supported

2511

formats.

2512

2513

As of Biopython 1.49 there is an optional argument 'alphabet' to

2514

specify the alphabet to be used. This is useful for file formats like

2515

FASTA where otherwise 'Bio.SeqIO' will default to a generic alphabet.

2516

The 'Bio.SeqIO.parse()' function returns an iterator which gives

2517

'SeqRecord' objects. Iterators are typically used in a for loop as shown

2518

below.

2519

Sometimes you'll find yourself dealing with files which contain only a

2520

single record. For this situation Biopython 1.45 introduced the function

2521

'Bio.SeqIO.read()' which takes the same arguments. Provided there is one

2522

and only one record in the file, this is returned as a 'SeqRecord'

2523

object. Otherwise an exception is raised.

2524

2525

2526

5.1.1 Reading Sequence Files

2527

=============================

2528

2529

In general 'Bio.SeqIO.parse()' is used to read in sequence files as

2530

'SeqRecord' objects, and is typically used with a for loop like this:

2531

<<from Bio import SeqIO

2532

handle = open("ls_orchid.fasta")

2533

for seq_record in SeqIO.parse(handle, "fasta") :

2534

print seq_record.id

2535

print repr(seq_record.seq)

2536

print len(seq_record)

2537

handle.close()

2538

>>

2539

2540

The above example is repeated from the introduction in Section 2.4,

2541

and will load the orchid DNA sequences in the FASTA format file

2542

ls_orchid.fasta (2). If instead you wanted to load a GenBank format file

2543

like ls_orchid.gbk (3) then all you need to do is change the filename

2544

and the format string:

2545

<<from Bio import SeqIO

2546

handle = open("ls_orchid.gbk")

2547

for seq_record in SeqIO.parse(handle, "genbank") :

2548

print seq_record.id

2549

print seq_record.seq

2550

print len(seq_record)

2551

handle.close()

2552

>>

2553

2554

Similarly, if you wanted to read in a file in another file format,

2555

then assuming 'Bio.SeqIO.parse()' supports it you would just need to

2556

change the format string as appropriate, for example "swiss" for

2557

SwissProt files or "embl" for EMBL text files. There is a full listing

2558

on the wiki page (http://biopython.org/wiki/SeqIO) and in the built in

2559

documentation (also online (4)).

2560

Another very common way to use a Python iterator is within a list

2561

comprehension (or a generator expression). For example, if all you

2562

wanted to extract from the file was a list of the record identifiers we

2563

can easily do this with the following list comprehension:

2564

<<>>> from Bio import SeqIO

2565

>>> handle = open("ls_orchid.gbk")

2566

>>> identifiers = [seq_record.id for seq_record in SeqIO.parse(handle,

2567

"genbank")]

2568

>>> handle.close()

2569

>>> identifiers

2570

['Z78533.1', 'Z78532.1', 'Z78531.1', 'Z78530.1', 'Z78529.1',

2571

'Z78527.1', ..., 'Z78439.1']

2572

>>

2573

2574

There are more examples using 'SeqIO.parse()' in a list comprehension

2575

like this in Section 14.2 (e.g. for plotting sequence lengths or GC%).

2576

2577

2578

5.1.2 Iterating over the records in a sequence file

2579

====================================================

2580

2581

In the above examples, we have usually used a for loop to iterate over

2582

all the records one by one. You can use the for loop with all sorts of

2583

Python objects (including lists, tuples and strings) which support the

2584

iteration interface.

2585

The object returned by 'Bio.SeqIO' is actually an iterator which

2586

returns 'SeqRecord' objects. You get to see each record in turn, but

2587

once and only once. The plus point is that an iterator can save you

2588

memory when dealing with large files.

2589

Instead of using a for loop, can also use the '.next()' method of an

2590

iterator to step through the entries, like this:

2591

<<from Bio import SeqIO

2592

handle = open("ls_orchid.fasta")

2593

record_iterator = SeqIO.parse(handle, "fasta")

2594

2595

first_record = record_iterator.next()

2596

print first_record.id

2597

print first_record.description

2598

2599

second_record = record_iterator.next()

2600

print second_record.id

2601

print second_record.description

2602

2603

handle.close()

2604

>>

2605

2606

Note that if you try and use '.next()' and there are no more results,

2607

you'll either get back the special Python object 'None' or a

2608

'StopIteration' exception.

2609

One special case to consider is when your sequence files have multiple

2610

records, but you only want the first one. In this situation the

2611

following code is very concise:

2612

<<from Bio import SeqIO

2613

first_record = SeqIO.parse(open("ls_orchid.gbk"), "genbank").next()

2614

>>

2615

2616

A word of warning here -- using the '.next()' method like this will

2617

silently ignore any additional records in the file. If your files have

2618

one and only one record, like some of the online examples later in this

2619

chapter, or a GenBank file for a single chromosome, then use the new

2620

'Bio.SeqIO.read()' function instead. This will check there are no extra

2621

unexpected records present.

2622

2623

2624

5.1.3 Getting a list of the records in a sequence file

2625

=======================================================

2626

2627

In the previous section we talked about the fact that

2628

'Bio.SeqIO.parse()' gives you a 'SeqRecord' iterator, and that you get

2629

the records one by one. Very often you need to be able to access the

2630

records in any order. The Python 'list' data type is perfect for this,

2631

and we can turn the record iterator into a list of 'SeqRecord' objects

2632

using the built-in Python function 'list()' like so:

2633

<<from Bio import SeqIO

2634

handle = open("ls_orchid.gbk")

2635

records = list(SeqIO.parse(handle, "genbank"))

2636

handle.close()

2637

2638

print "Found %i records" % len(records)

2639

2640

print "The last record"

2641

last_record = records[-1] #using Python's list tricks

2642

print last_record.id

2643

print repr(last_record.seq)

2644

print len(last_record)

2645

2646

print "The first record"

2647

first_record = records[0] #remember, Python counts from zero

2648

print first_record.id

2649

print repr(first_record.seq)

2650

print len(first_record)

2651

>>

2652

2653

Giving:

2654

<<Found 94 records

2655

The last record

2656

Z78439.1

2657

Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC',

2658

IUPACAmbiguousDNA())

2659

592

2660

The first record

2661

Z78533.1

2662

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',

2663

IUPACAmbiguousDNA())

2664

740

2665

>>

2666

2667

You can of course still use a for loop with a list of 'SeqRecord'

2668

objects. Using a list is much more flexible than an iterator (for

2669

example, you can determine the number of records from the length of the

2670

list), but does need more memory because it will hold all the records in

2671

memory at once.

2672

2673

2674

5.1.4 Extracting data

2675

======================

2676

2677

The 'SeqRecord' object and its annotation structures are described

2678

more fully in Chapter 4. As an example of how annotations are stored,

2679

we'll look at the output from parsing the first record in the GenBank

2680

file ls_orchid.gbk (5).

2681

<<from Bio import SeqIO

2682

record_iterator = SeqIO.parse(open("ls_orchid.gbk"), "genbank")

2683

first_record = record_iterator.next()

2684

print first_record

2685

>>

2686

2687

That should give something like this:

2688

<<ID: Z78533.1

2689

Name: Z78533

2690

Description: C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA.

2691

Number of features: 5

2692

/sequence_version=1

2693

/source=Cypripedium irapeanum

2694

/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', ...,

2695

'Cypripedium']

2696

/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', ..., 'ITS1',

2697

'ITS2']

2698

/references=[...]

2699

/accessions=['Z78533']

2700

/data_file_division=PLN

2701

/date=30-NOV-2006

2702

/organism=Cypripedium irapeanum

2703

/gi=2765658

2704

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',

2705

IUPACAmbiguousDNA())

2706

>>

2707

2708

This gives a human readable summary of most of the annotation data for

2709

the 'SeqRecord'. For this example we're going to use the '.annotations'

2710

attribute which is just a Python dictionary. The contents of this

2711

annotations dictionary were shown when we printed the record above. You

2712

can also print them out directly:

2713

<<print first_record.annotations

2714

>>

2715

Like any Python dictionary, you can easily get a list of the keys:

2716

<<print first_record.annotations.keys()

2717

>>

2718

or values:

2719

<<print first_record.annotations.values()

2720

>>

2721

2722

In general, the annotation values are strings, or lists of strings.

2723

One special case is any references in the file get stored as reference

2724

objects.

2725

Suppose you wanted to extract a list of the species from the

2726

ls_orchid.gbk (6) GenBank file. The information we want, Cypripedium

2727

irapeanum, is held in the annotations dictionary under `source' and

2728

`organism', which we can access like this:

2729

<<>>> print first_record.annotations["source"]

2730

Cypripedium irapeanum

2731

>>

2732

2733

or:

2734

<<>>> print first_record.annotations["organism"]

2735

Cypripedium irapeanum

2736

>>

2737

2738

In general, `organism' is used for the scientific name (in Latin, e.g.

2739

Arabidopsis thaliana), while `source' will often be the common name

2740

(e.g. thale cress). In this example, as is often the case, the two

2741

fields are identical.

2742

Now let's go through all the records, building up a list of the

2743

species each orchid sequence is from:

2744

<<from Bio import SeqIO

2745

handle = open("ls_orchid.gbk")

2746

all_species = []

2747

for seq_record in SeqIO.parse(handle, "genbank") :

2748

all_species.append(seq_record.annotations["organism"])

2749

handle.close()

2750

print all_species

2751

>>

2752

2753

Another way of writing this code is to use a list comprehension:

2754

<<from Bio import SeqIO

2755

all_species = [seq_record.annotations["organism"] for seq_record in \

2756

SeqIO.parse(open("ls_orchid.gbk"), "genbank")]

2757

print all_species

2758

>>

2759

2760

In either case, the result is:

2761

<<['Cypripedium irapeanum', 'Cypripedium californicum', ...,

2762

'Paphiopedilum barbatum']

2763

>>

2764

2765

Great. That was pretty easy because GenBank files are annotated in a

2766

standardised way.

2767

Now, let's suppose you wanted to extract a list of the species from a

2768

FASTA file, rather than the GenBank file. The bad news is you will have

2769

to write some code to extract the data you want from the record's

2770

description line - if the information is in the file in the first place!

2771

Our example FASTA format file ls_orchid.fasta (7) starts like this:

2772

<<>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1

2773

and ITS2 DNA

2774

CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG

2775

AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG

2776

...

2777

>>

2778

2779

You can check by hand, but for every record the species name is in the

2780

description line as the second word. This means if we break up each

2781

record's '.description' at the spaces, then the species is there as

2782

field number one (field zero is the record identifier). That means we

2783

can do this:

2784

<<from Bio import SeqIO

2785

handle = open("ls_orchid.fasta")

2786

all_species = []

2787

for seq_record in SeqIO.parse(handle, "fasta") :

2788

all_species.append(seq_record.description.split()[1])

2789

handle.close()

2790

print all_species

2791

>>

2792

2793

This gives:

2794

<<['C.irapeanum', 'C.californicum', 'C.fasciculatum', 'C.margaritaceum',

2795

..., 'P.barbatum']

2796

>>

2797

2798

The concise alternative using list comprehensions would be:

2799

<<from Bio import SeqIO

2800

all_species == [seq_record.description.split()[1] for seq_record in \

2801

SeqIO.parse(open("ls_orchid.fasta"), "fasta")]

2802

print all_species

2803

>>

2804

2805

In general, extracting information from the FASTA description line is

2806

not very nice. If you can get your sequences in a well annotated file

2807

format like GenBank or EMBL, then this sort of annotation information is

2808

much easier to deal with.

2809

2810

2811

5.2 Parsing sequences from the net

2812

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

2813

2814

In the previous section, we looked at parsing sequence data from a

2815

file handle. We hinted that handles are not always from files, and in

2816

this section we'll use handles to internet connections to download

2817

sequences.

2818

Note that just because you can download sequence data and parse it

2819

into a 'SeqRecord' object in one go doesn't mean this is a good idea. In

2820

general, you should probably download sequences once and save them to a

2821

file for reuse.

2822

2823

2824

5.2.1 Parsing GenBank records from the net

2825

===========================================

2826

Section 8.6 talks about the Entrez EFetch interface in more detail,

2827

but for now let's just connect to the NCBI and get a few Opuntia

2828

(prickly-pear) sequences from GenBank using their GI numbers.

2829

First of all, let's fetch just one record. If you don't care about the

2830

annotations and features downloading a FASTA file is a good choice as

2831

these are compact. Now remember, when you expect the handle to contain

2832

one and only one record, use the 'Bio.SeqIO.read()' function:

2833

<<from Bio import Entrez

2834

from Bio import SeqIO

2835

handle = Entrez.efetch(db="nucleotide", rettype="fasta", id="6273291")

2836

seq_record = SeqIO.read(handle, "fasta")

2837

handle.close()

2838

print "%s with %i features" % (seq_record.id,

2839

len(seq_record.features))

2840

>>

2841

2842

Expected output:

2843

<<gi|6273291|gb|AF191665.1|AF191665 with 0 features

2844

>>

2845

2846

The NCBI will also let you ask for the file in other formats, in

2847

particular as a GenBank file. Until Easter 2009, the Entrez EFetch API

2848

let you use "genbank" as the return type, however the NCBI now insist on

2849

using the official return types of "gb" (or "gp" for proteins) as

2850

described on EFetch for Sequence and other Molecular Biology

2851

Databases (8). As a result, in Biopython 1.50 onwards, we support "gb"

2852

as an alias for "genbank" in 'Bio.SeqIO'.

2853

<<from Bio import Entrez

2854

from Bio import SeqIO

2855

handle = Entrez.efetch(db="nucleotide", rettype="gb", id="6273291")

2856

seq_record = SeqIO.read(handle, "gb") #using "gb" as an alias for

2857

"genbank"

2858

handle.close()

2859

print "%s with %i features" % (seq_record.id,

2860

len(seq_record.features))

2861

>>

2862

2863

The expected output of this example is:

2864

<<AF191665.1 with 3 features

2865

>>

2866

2867

Notice this time we have three features.

2868

Now let's fetch several records. This time the handle contains

2869

multiple records, so we must use the 'Bio.SeqIO.parse()' function:

2870

<<from Bio import Entrez

2871

from Bio import SeqIO

2872

handle = Entrez.efetch(db="nucleotide", rettype="gb", \

2873

id="6273291,6273290,6273289")

2874

for seq_record in SeqIO.parse(handle, "gb") :

2875

print seq_record.id, seq_record.description[:50] + "..."

2876

print "Sequence length %i," % len(seq_record),

2877

print "%i features," % len(seq_record.features),

2878

print "from: %s" % seq_record.annotations["source"]

2879

handle.close()

2880

>>

2881

2882

That should give the following output:

2883

<<AF191665.1 Opuntia marenae rpl16 gene; chloroplast gene for c...

2884

Sequence length 902, 3 features, from: chloroplast Opuntia marenae

2885

AF191664.1 Opuntia clavata rpl16 gene; chloroplast gene for c...

2886

Sequence length 899, 3 features, from: chloroplast Grusonia clavata

2887

AF191663.1 Opuntia bradtiana rpl16 gene; chloroplast gene for...

2888

Sequence length 899, 3 features, from: chloroplast Opuntia bradtianaa

2889

>>

2890

2891

See Chapter 8 for more about the 'Bio.Entrez' module, and make sure to

2892

read about the NCBI guidelines for using Entrez (Section 8.1).

2893

2894

2895

5.2.2 Parsing SwissProt sequences from the net

2896

===============================================

2897

Now let's use a handle to download a SwissProt file from ExPASy,

2898

something covered in more depth in Chapter 9. As mentioned above, when

2899

you expect the handle to contain one and only one record, use the

2900

'Bio.SeqIO.read()' function:

2901

<<from Bio import ExPASy

2902

from Bio import SeqIO

2903

handle = ExPASy.get_sprot_raw("O23729")

2904

seq_record = SeqIO.read(handle, "swiss")

2905

handle.close()

2906

print seq_record.id

2907

print seq_record.name

2908

print seq_record.description

2909

print repr(seq_record.seq)

2910

print "Length %i" % len(seq_record)

2911

print seq_record.annotations["keywords"]

2912

>>

2913

2914

Assuming your network connection is OK, you should get back:

2915

<<O23729

2916

CHS3_BROFI

2917

RecName: Full=Chalcone synthase 3; EC=2.3.1.74; AltName:

2918

Full=Naringenin-chalcone synthase 3;

2919

Seq('MAPAMEEIRQAQRAEGPAAVLAIGTSTPPNALYQADYPDYYFRITKSEHLTELK...GAE',

2920

ProteinAlphabet())

2921

Length 394

2922

['Acyltransferase', 'Flavonoid biosynthesis', 'Transferase']

2923

>>

2924

2925

2926

2927

5.3 Sequence files as Dictionaries

2928

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

2929

2930

2931

The next thing that we'll do with our ubiquitous orchid files is to

2932

show how to index them and access them like a database using the Python

2933

'dictionary' data type (like a hash in Perl). This is very useful for

2934

moderately large files where you only need to access certain elements of

2935

the file, and makes for a nice quick 'n dirty database.

2936

You can use the function 'Bio.SeqIO.to_dict()' to make a SeqRecord

2937

dictionary (in memory). By default this will use each record's

2938

identifier (i.e. the '.id' attribute) as the key. Let's try this using

2939

our GenBank file:

2940

<<from Bio import SeqIO

2941

handle = open("ls_orchid.gbk")

2942

orchid_dict = SeqIO.to_dict(SeqIO.parse(handle, "genbank"))

2943

handle.close()

2944

>>

2945

2946

Since this variable 'orchid_dict' is an ordinary Python dictionary, we

2947

can look at all of the keys we have available:

2948

<<>>> print orchid_dict.keys()

2949

['Z78484.1', 'Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1',

2950

'Z78453.1', ..., 'Z78471.1']

2951

>>

2952

2953

We can access a single 'SeqRecord' object via the keys and manipulate

2954

the object as normal:

2955

<<>>> seq_record = orchid_dict["Z78475.1"]

2956

>>> print seq_record.description

2957

P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA

2958

>>> print repr(seq_record.seq)

2959

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT',

2960

IUPACAmbiguousDNA())

2961

>>

2962

2963

So, it is very easy to create an in memory "database" of our GenBank

2964

records. Next we'll try this for the FASTA file instead.

2965

Note that those of you with prior Python experience should all be able

2966

to construct a dictionary like this "by hand". However, typical

2967

dictionary construction methods will not deal with the case of repeated

2968

keys very nicely. Using the 'Bio.SeqIO.to_dict()' will explicitly check

2969

for duplicate keys, and raise an exception if any are found.

2970

2971

2972

5.3.1 Specifying the dictionary keys

2973

=====================================

2974

2975

Using the same code as above, but for the FASTA file instead:

2976

<<from Bio import SeqIO

2977

handle = open("ls_orchid.fasta")

2978

orchid_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))

2979

handle.close()

2980

print orchid_dict.keys()

2981

>>

2982

2983

This time the keys are:

2984

<<['gi|2765596|emb|Z78471.1|PDZ78471',

2985

'gi|2765646|emb|Z78521.1|CCZ78521', ...

2986

..., 'gi|2765613|emb|Z78488.1|PTZ78488',

2987

'gi|2765583|emb|Z78458.1|PHZ78458']

2988

>>

2989

2990

You should recognise these strings from when we parsed the FASTA file

2991

earlier in Section 2.4.1. Suppose you would rather have something else

2992

as the keys - like the accession numbers. This brings us nicely to

2993

'SeqIO.to_dict()''s optional argument 'key_function', which lets you

2994

define what to use as the dictionary key for your records.

2995

First you must write your own function to return the key you want (as

2996

a string) when given a 'SeqRecord' object. In general, the details of

2997

function will depend on the sort of input records you are dealing with.

2998

But for our orchids, we can just split up the record's identifier using

2999

the "pipe" character (the vertical line) and return the fourth entry

3000

(field three):

3001

<<def get_accession(record) :

3002

""""Given a SeqRecord, return the accession number as a string.

3003

3004

e.g. "gi|2765613|emb|Z78488.1|PTZ78488" -> "Z78488.1"

3005

"""

3006

parts = record.id.split("|")

3007

assert len(parts) == 5 and parts[0] == "gi" and parts[2] == "emb"

3008

return parts[3]

3009

>>

3010

3011

Then we can give this function to the 'SeqIO.to_dict()' function to

3012

use in building the dictionary:

3013

<<from Bio import SeqIO

3014

handle = open("ls_orchid.fasta")

3015

orchid_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"),

3016

key_function=get_accession)

3017

handle.close()

3018

print orchid_dict.keys()

3019

>>

3020

3021

Finally, as desired, the new dictionary keys:

3022

<<>>> print orchid_dict.keys()

3023

['Z78484.1', 'Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1',

3024

'Z78453.1', ..., 'Z78471.1']

3025

>>

3026

3027

Not too complicated, I hope!

3028

3029

3030

5.3.2 Indexing a dictionary using the SEGUID checksum

3031

======================================================

3032

3033

To give another example of working with dictionaries of 'SeqRecord'

3034

objects, we'll use the SEGUID checksum function. This is a relatively

3035

recent checksum, and collisions should be very rare (i.e. two different

3036

sequences with the same checksum), an improvement on the CRC64 checksum.

3037

Once again, working with the orchids GenBank file:

3038

<<from Bio import SeqIO

3039

from Bio.SeqUtils.CheckSum import seguid

3040

for record in SeqIO.parse(open("ls_orchid.gbk"), "genbank") :

3041

print record.id, seguid(record.seq)

3042

>>

3043

3044

This should give:

3045

<<Z78533.1 JUEoWn6DPhgZ9nAyowsgtoD9TTo

3046

Z78532.1 MN/s0q9zDoCVEEc+k/IFwCNF2pY

3047

...

3048

Z78439.1 H+JfaShya/4yyAj7IbMqgNkxdxQ

3049

>>

3050

3051

Now, recall the 'Bio.SeqIO.to_dict()' function's 'key_function'

3052

argument expects a function which turns a 'SeqRecord' into a string. We

3053

can't use the 'seguid()' function directly because it expects to be

3054

given a 'Seq' object (or a string). However, we can use Python's

3055

'lambda' feature to create a "one off" function to give to

3056

'Bio.SeqIO.to_dict()' instead:

3057

<<from Bio import SeqIO

3058

from Bio.SeqUtils.CheckSum import seguid

3059

seguid_dict = SeqIO.to_dict(SeqIO.parse(open("ls_orchid.gbk"),

3060

"genbank"),

3061

lambda rec : seguid(rec.seq))

3062

record = seguid_dict["MN/s0q9zDoCVEEc+k/IFwCNF2pY"]

3063

print record.id

3064

print record.description

3065

>>

3066

3067

That should have retrieved the record Z78532.1, the second entry in

3068

the file.

3069

3070

3071

5.4 Writing Sequence Files

3072

*=*=*=*=*=*=*=*=*=*=*=*=*=*

3073

3074

3075

We've talked about using 'Bio.SeqIO.parse()' for sequence input

3076

(reading files), and now we'll look at 'Bio.SeqIO.write()' which is for

3077

sequence output (writing files). This is a function taking three

3078

arguments: some 'SeqRecord' objects, a handle to write to, and a

3079

sequence format.

3080

Here is an example, where we start by creating a few 'SeqRecord'

3081

objects the hard way (by hand, rather than by loading them from a file):

3082

<<from Bio.Seq import Seq

3083

from Bio.SeqRecord import SeqRecord

3084

from Bio.Alphabet import generic_protein

3085

3086

rec1 =

3087

SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQAL

3088

FGD" \

3089

3090

+"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \

3091

3092

+"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \

3093

+"SSAC", generic_protein),

3094

id="gi|14150838|gb|AAK54648.1|AF376133_1",

3095

description="chalcone synthase [Cucumis sativus]")

3096

3097

rec2 =

3098

SeqRecord(Seq("YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLD

3099

ARQ" \

3100

+"DMVVVEIPKLGKEAAVKAIKEWGQ", generic_protein),

3101

id="gi|13919613|gb|AAK33142.1|",

3102

description="chalcone synthase [Fragaria vesca subsp.

3103

bracteata]")

3104

3105

rec3 =

3106

SeqRecord(Seq("MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFK

3107

RMC" \

3108

3109

+"EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP" \

3110

3111

+"KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN" \

3112

3113

+"NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV" \

3114

3115

+"SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW" \

3116

3117

+"IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT" \

3118

+"TGEGLEWGVLFGFGPGLTVETVVLHSVAT",

3119

generic_protein),

3120

id="gi|13925890|gb|AAK49457.1|",

3121

description="chalcone synthase [Nicotiana tabacum]")

3122

3123

my_records = [rec1, rec2, rec3]

3124

>>

3125

3126

Now we have a list of 'SeqRecord' objects, we'll write them to a FASTA

3127

format file:

3128

<<from Bio import SeqIO

3129

handle = open("my_example.faa", "w")

3130

SeqIO.write(my_records, handle, "fasta")

3131

handle.close()

3132

>>

3133

3134

And if you open this file in your favourite text editor it should look

3135

like this:

3136

<<>gi|14150838|gb|AAK54648.1|AF376133_1 chalcone synthase [Cucumis

3137

sativus]

3138

MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD

3139

GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK

3140

NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM

3141

SSAC

3142

>gi|13919613|gb|AAK33142.1| chalcone synthase [Fragaria vesca subsp.

3143

bracteata]

3144

YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ

3145

DMVVVEIPKLGKEAAVKAIKEWGQ

3146

>gi|13925890|gb|AAK49457.1| chalcone synthase [Nicotiana tabacum]

3147

MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC

3148

EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP

3149

KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN

3150

NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV

3151

SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW

3152

IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT

3153

TGEGLEWGVLFGFGPGLTVETVVLHSVAT

3154

>>

3155

3156

Suppose you wanted to know how many records the 'Bio.SeqIO.write()'

3157

function wrote to the handle? If your records were in a list you could

3158

just use 'len(my_records)', however you can't do that when your records

3159

come from a generator/iterator. Therefore as of Biopython 1.49, the

3160

'Bio.SeqIO.write()' function returns the number of 'SeqRecord' objects

3161

written to the file.

3162

3163

3164

5.4.1 Converting between sequence file formats

3165

===============================================

3166

3167

In previous example we used a list of 'SeqRecord' objects as input to

3168

the 'Bio.SeqIO.write()' function, but it will also accept a 'SeqRecord'

3169

iterator like we get from 'Bio.SeqIO.parse()' -- this lets us do file

3170

conversion very succinctly. For this example we'll read in the GenBank

3171

format file ls_orchid.gbk (9) and write it out in FASTA format:

3172

<<from Bio import SeqIO

3173

in_handle = open("ls_orchid.gbk", "r")

3174

out_handle = open("my_example.fasta", "w")

3175

records = SeqIO.parse(in_handle, "genbank")

3176

SeqIO.write(records, out_handle, "fasta")

3177

in_handle.close()

3178

out_handle.close()

3179

>>

3180

3181

In principle, just by changing the filenames and the format names,

3182

this code could be used to convert between any file formats available in

3183

Biopython. However, writing some formats requires information (e.g.

3184

quality scores) which other files formats don't contain. For example,

3185

while you can turn a FASTQ file into a FASTA file, you can't do the

3186

reverse. See also Section 14.1.6 in the cookbook chapter which looks at

3187

inter-converting between different FASTQ formats.

3188

You can simplify this by being lazy about closing the input file

3189

handles. This is arguably bad style, but it is more concise. Note that

3190

you should always close your outfile file handles as if you don't, your

3191

file may not get flushed to disk immediately.

3192

Alternatively, Python 2.6 includes 'with' as a new keyword (which can

3193

also be enabled on Python 2.5):

3194

<<from __future__ import with_statement #Needed on Python 2.5

3195

from Bio import SeqIO

3196

3197

with in_handle = open("ls_orchid.gbk") :

3198

with out_handle = open("my_example.fasta", "w") :

3199

SeqIO.write(SeqIO.parse(in_handle, "genbank"), out_handle,

3200

"fasta")

3201

>>

3202

3203

Behind the scenes this will automatically close the handle handles

3204

(because the file objects are aware of the 'with' statement).

3205

3206

3207

5.4.2 Converting a file of sequences to their reverse complements

3208

==================================================================

3209

Suppose you had a file of nucleotide sequences, and you wanted to

3210

turn it into a file containing their reverse complement sequences. This

3211

time a little bit of work is required to transform the 'SeqRecord'

3212

objects we get from our input file into something suitable for saving to

3213

our output file.

3214

To start with, we'll use 'Bio.SeqIO.parse()' to load some nucleotide

3215

sequences from a file, then print out their reverse complements using

3216

the 'Seq' object's built in '.reverse_complement()' method (see

3217

Section 3.6):

3218

<<from Bio import SeqIO

3219

in_handle = open("ls_orchid.gbk")

3220

for record in SeqIO.parse(in_handle, "genbank") :

3221

print record.id

3222

print record.seq.reverse_complement()

3223

in_handle.close()

3224

>>

3225

3226

Now, if we want to save these reverse complements to a file, we'll

3227

need to make 'SeqRecord' objects. For this I think its most elegant to

3228

write our own function, where we can decide how to name our new records:

3229

<<from Bio.SeqRecord import SeqRecord

3230

3231

def make_rc_record(record) :

3232

"""Returns a new SeqRecord with the reverse complement

3233

sequence."""

3234

return SeqRecord(seq = record.seq.reverse_complement(), \

3235

id = "rc_" + record.id, \

3236

description = "reverse complement")

3237

>>

3238

3239

We can then use this to turn the input records into reverse complement

3240

records ready for output. If you don't mind about having all the records

3241

in memory at once, then the Python 'map()' function is a very intuitive

3242

way of doing this:

3243

<<from Bio import SeqIO

3244

3245

in_handle = open("ls_orchid.fasta")

3246

records = map(make_rc_record, SeqIO.parse(in_handle, "fasta"))

3247

in_handle.close()

3248

3249

out_handle = open("rev_comp.fasta", "w")

3250

SeqIO.write(records, out_handle, "fasta")

3251

out_handle.close()

3252

>>

3253

3254

This is an excellent place to demonstrate the power of list

3255

comprehensions which in their simplest are a long-winded equivalent to

3256

using 'map()', like this:

3257

<<records = [make_rc_record(rec) for rec in SeqIO.parse(in_handle,

3258

"fasta")]

3259

>>

3260

3261

Now list comprehensions have a nice trick up their sleeves, you can

3262

add a conditional statement:

3263

<<records = [make_rc_record(rec) for rec in SeqIO.parse(in_handle,

3264

"fasta") if len(rec)<700]

3265

>>

3266

3267

That would create an in memory list of reverse complement records

3268

where the sequence length was under 700 base pairs. However, we can do

3269

exactly the same with a generator expression - but with the advantage

3270

that this does not create a list of all the records in memory at once:

3271

<<records = (make_rc_record(rec) for rec in SeqIO.parse(in_handle,

3272

"fasta") if len(rec)<700)

3273

>>

3274

3275

If you don't mind being lax about closing input file handles, we have:

3276

<<from Bio import SeqIO

3277

3278

records = (make_rc_record(rec) for rec in \

3279

SeqIO.parse(open("ls_orchid.fasta"), "fasta") \

3280

if len(rec) < 700)

3281

3282

out_handle = open("rev_comp.fasta", "w")

3283

SeqIO.write(records, out_handle, "fasta")

3284

out_handle.close()

3285

>>

3286

3287

There is a related example in Section 14.1.2, translating each record

3288

in a FASTA file from nucleotides to amino acids.

3289

3290

3291

5.4.3 Getting your SeqRecord objects as formatted strings

3292

==========================================================

3293

Suppose that you don't really want to write your records to a file

3294

or handle -- instead you want a string containing the records in a

3295

particular file format. The 'Bio.SeqIO' interface is based on handles,

3296

but Python has a useful built in module which provides a string based

3297

handle.

3298

For an example of how you might use this, let's load in a bunch of

3299

'SeqRecord' objects from our orchids GenBank file, and create a string

3300

containing the records in FASTA format:

3301

<<from Bio import SeqIO

3302

from StringIO import StringIO

3303

3304

records = SeqIO.parse(open("ls_orchid.gbk"), "genbank")

3305

3306

out_handle = StringIO()

3307

SeqIO.write(records, out_handle, "fasta")

3308

fasta_data = out_handle.getvalue()

3309

3310

print fasta_data

3311

>>

3312

3313

This isn't entirely straightforward the first time you see it! On the

3314

bright side, for the special case where you would like a string

3315

containing a single record in a particular file format, Biopython 1.48

3316

added a new 'format()' method to the 'SeqRecord' class (see

3317

Section 4.5).

3318

Note that although we don't encourage it, you can use the 'format()'

3319

method to write to a file, like this:

3320

<<from Bio import SeqIO

3321

record_iterator = SeqIO.parse(open("ls_orchid.gbk"), "genbank")

3322

out_handle = open("ls_orchid.tab", "w")

3323

for record in record_iterator :

3324

out_handle.write(record.format("tab"))

3325

out_handle.close()

3326

>>

3327

While this style of code will work for a simple sequential file format

3328

like FASTA or the simple tab separated format used in this example, it

3329

will not work for more complex or interlaced file formats. This is why

3330

we still recommend using 'Bio.SeqIO.write()', as in the following

3331

example:

3332

<<from Bio import SeqIO

3333

record_iterator = SeqIO.parse(open("ls_orchid.gbk"), "genbank")

3334

out_handle = open("ls_orchid.tab", "w")

3335

SeqIO.write(record_iterator, out_handle, "tab")

3336

out_handle.close()

3337

>>

3338

3339

-----------------------------------

3340

3341

3342

(1) http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html

3343

3344

(2) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta

3345

3346

(3) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk

3347

3348

(4) http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html

3349

3350

(5) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk

3351

3352

(6) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk

3353

3354

(7) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta

3355

3356

(8) http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html

3357

3358

(9) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk

3359

3360

3361

Chapter 6 Sequence Alignment Input/Output, and Alignment Tools

3362

*****************************************************************

3363

3364

In this chapter we'll discuss the 'Bio.AlignIO' module, which is very

3365

similar to the 'Bio.SeqIO' module from the previous chapter, but deals

3366

with 'Alignment' objects rather than 'SeqRecord' objects. This aims to

3367

provide a simple interface for working with assorted sequence alignment

3368

file formats in a uniform way.

3369

Note that both 'Bio.SeqIO' and 'Bio.AlignIO' can read and write

3370

sequence alignment files. The appropriate choice will depend largely on

3371

what you want to do with the data.

3372

The final part of this chapter is about our command line wrappers for

3373

common multiple sequence alignment tools like ClustalW and MUSCLE.

3374

3375

3376

6.1 Parsing or Reading Sequence Alignments

3377

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

3378

3379

3380

We have two functions for reading in sequence alignments,

3381

'Bio.AlignIO.read()' and 'Bio.AlignIO.parse()' which following the

3382

convention introduced in 'Bio.SeqIO' are for files containing one or

3383

multiple alignments respectively.

3384

Using 'Bio.AlignIO.parse()' will return an iterator which gives

3385

'Alignment' objects. Iterators are typically used in a for loop.

3386

Examples of situations where you will have multiple different alignments

3387

include resampled alignments from the PHYLIP tool 'seqboot', or multiple

3388

pairwise alignments from the EMBOSS tools 'water' or 'needle', or Bill

3389

Pearson's FASTA tools.

3390

However, in many situations you will be dealing with files which

3391

contain only a single alignment. In this case, you should use the

3392

'Bio.AlignIO.read()' function which returns a single 'Alignment' object.

3393

Both functions expect two mandatory arguments:

3394

3395

3396

1. The first argument is a handle to read the data from, typically an

3397

open file (see Section 18.1).

3398

2. The second argument is a lower case string specifying the

3399

alignment format. As in 'Bio.SeqIO' we don't try and guess the file

3400

format for you! See http://biopython.org/wiki/AlignIO for a full

3401

listing of supported formats.

3402

3403

There is also an optional 'seq_count' argument which is discussed in

3404

Section 6.1.3 below for dealing with ambiguous file formats which may

3405

contain more than one alignment.

3406

Biopython 1.49 introduced a further optional 'alphabet' argument

3407

allowing you to specify the expected alphabet. This can be useful as

3408

many alignment file formats do not explicitly label the sequences as

3409

RNA, DNA or protein -- which means 'Bio.AlignIO' will default to using a

3410

generic alphabet.

3411

3412

3413

6.1.1 Single Alignments

3414

========================

3415

As an example, consider the following annotation rich protein

3416

alignment in the PFAM or Stockholm file format:

3417

<<# STOCKHOLM 1.0

3418

#=GS COATB_BPIKE/30-81 AC P03620.1

3419

#=GS COATB_BPIKE/30-81 DR PDB; 1ifl ; 1-52;

3420

#=GS Q9T0Q8_BPIKE/1-52 AC Q9T0Q8.1

3421

#=GS COATB_BPI22/32-83 AC P15416.1

3422

#=GS COATB_BPM13/24-72 AC P69541.1

3423

#=GS COATB_BPM13/24-72 DR PDB; 2cpb ; 1-49;

3424

#=GS COATB_BPM13/24-72 DR PDB; 2cps ; 1-49;

3425

#=GS COATB_BPZJ2/1-49 AC P03618.1

3426

#=GS Q9T0Q9_BPFD/1-49 AC Q9T0Q9.1

3427

#=GS Q9T0Q9_BPFD/1-49 DR PDB; 1nh4 A; 1-49;

3428

#=GS COATB_BPIF1/22-73 AC P03619.2

3429

#=GS COATB_BPIF1/22-73 DR PDB; 1ifk ; 1-50;

3430

COATB_BPIKE/30-81

3431

AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA

3432

#=GR COATB_BPIKE/30-81 SS

3433

-HHHHHHHHHHHHHH--HHHHHHHH--HHHHHHHHHHHHHHHHHHHHH----

3434

Q9T0Q8_BPIKE/1-52

3435

AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA

3436

COATB_BPI22/32-83

3437

DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA

3438

COATB_BPM13/24-72

3439

AEGDDP...AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA

3440

#=GR COATB_BPM13/24-72 SS

3441

---S-T...CHCHHHHCCCCTCCCTTCHHHHHHHHHHHHHHHHHHHHCTT--

3442

COATB_BPZJ2/1-49

3443

AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA

3444

Q9T0Q9_BPFD/1-49

3445

AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA

3446

#=GR Q9T0Q9_BPFD/1-49 SS

3447

------...-HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH--

3448

COATB_BPIF1/22-73

3449

FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA

3450

#=GR COATB_BPIF1/22-73 SS

3451

XX-HHHH--HHHHHH--HHHHHHH--HHHHHHHHHHHHHHHHHHHHHHH---

3452

#=GC SS_cons

3453

XHHHHHHHHHHHHHHHCHHHHHHHHCHHHHHHHHHHHHHHHHHHHHHHHC--

3454

#=GC seq_cons

3455

AEssss...AptAhDSLpspAT-hIu.sWshVsslVsAsluIKLFKKFsSKA

3456

//

3457

>>

3458

3459

This is the seed alignment for the Phage_Coat_Gp8 (PF05371) PFAM

3460

entry, downloaded as a compressed archive from

3461

http://pfam.sanger.ac.uk/family/alignment/download/gzipped?acc=PF05371&a

3462

lnType=seed. We can load this file as follows (assuming it has been

3463

saved to disk as "PF05371_seed.sth" in the current working directory):

3464

<<from Bio import AlignIO

3465

alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm")

3466

print alignment

3467

>>

3468

3469

This code will print out a summary of the alignment:

3470

<<SingleLetterAlphabet() alignment with 7 rows and 52 columns

3471

AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKA COATB_BPIKE/30-81

3472

AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRA Q9T0Q8_BPIKE/1-52

3473

DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKA COATB_BPI22/32-83

3474

AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPM13/24-72

3475

AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPZJ2/1-49

3476

AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA Q9T0Q9_BPFD/1-49

3477

FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRA COATB_BPIF1/22-73

3478

>>

3479

3480

You'll notice in the above output the sequences have been truncated.

3481

We could instead write our own code to format this as we please by

3482

iterating over the rows as 'SeqRecord' objects:

3483

<<from Bio import AlignIO

3484

alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm")

3485

print "Alignment length %i" % alignment.get_alignment_length()

3486

for record in alignment :

3487

print "%s - %s" % (record.seq, record.id)

3488

>>

3489

3490

This will give the following output:

3491

<<Alignment length 52

3492

AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA -

3493

COATB_BPIKE/30-81

3494

AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA -

3495

Q9T0Q8_BPIKE/1-52

3496

DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA -

3497

COATB_BPI22/32-83

3498

AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA -

3499

COATB_BPM13/24-72

3500

AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA -

3501

COATB_BPZJ2/1-49

3502

AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA -

3503

Q9T0Q9_BPFD/1-49

3504

FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA -

3505

COATB_BPIF1/22-73

3506

>>

3507

3508

You could also use the alignment object's 'format' method to show it

3509

in a particular file format -- see Section 6.2.2 for details.

3510

Did you notice in the raw file above that several of the sequences

3511

include database cross-references to the PDB and the associated known

3512

secondary structure? Try this:

3513

<<for record in alignment :

3514

if record.dbxrefs :

3515

print record.id, record.dbxrefs

3516

>>

3517

3518

giving:

3519

<<COATB_BPIKE/30-81 ['PDB; 1ifl ; 1-52;']

3520

COATB_BPM13/24-72 ['PDB; 2cpb ; 1-49;', 'PDB; 2cps ; 1-49;']

3521

Q9T0Q9_BPFD/1-49 ['PDB; 1nh4 A; 1-49;']

3522

COATB_BPIF1/22-73 ['PDB; 1ifk ; 1-50;']

3523

>>

3524

3525

To have a look at all the sequence annotation, try this:

3526

<<for record in alignment :

3527

print record

3528

>>

3529

3530

Sanger provide a nice web interface at

3531

http://pfam.sanger.ac.uk/family?acc=PF05371 which will actually let you

3532

download this alignment in several other formats. This is what the file

3533

looks like in the FASTA file format:

3534

<<>COATB_BPIKE/30-81

3535

AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA

3536

>Q9T0Q8_BPIKE/1-52

3537

AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA

3538

>COATB_BPI22/32-83

3539

DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA

3540

>COATB_BPM13/24-72

3541

AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA

3542

>COATB_BPZJ2/1-49

3543

AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA

3544

>Q9T0Q9_BPFD/1-49

3545

AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA

3546

>COATB_BPIF1/22-73

3547

FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA

3548

>>

3549

3550

Note the website should have an option about showing gaps as periods

3551

(dots) or dashes, we've shown dashes above. Assuming you download and

3552

save this as file "PF05371_seed.faa" then you can load it with almost

3553

exactly the same code:

3554

<<from Bio import AlignIO

3555

alignment = AlignIO.read(open("PF05371_seed.faa"), "fasta")

3556

print alignment

3557

>>

3558

3559

All that has changed in this code is the filename and the format

3560

string. You'll get the same output as before, the sequences and record

3561

identifiers are the same. However, as you should expect, if you check

3562

each 'SeqRecord' there is no annotation nor database cross-references

3563

because these are not included in the FASTA file format.

3564

Note that rather than using the Sanger website, you could have used

3565

'Bio.AlignIO' to convert the original Stockholm format file into a FASTA

3566

file yourself (see below).

3567

With any supported file format, you can load an alignment in exactly

3568

the same way just by changing the format string. For example, use

3569

"phylip" for PHYLIP files, "nexus" for NEXUS files or "emboss" for the

3570

alignments output by the EMBOSS tools. There is a full listing on the

3571

wiki page (http://biopython.org/wiki/AlignIO) and in the built in

3572

documentation (also online (1)):

3573

<<>>> from Bio import AlignIO

3574

>>> help(AlignIO)

3575

...

3576

>>

3577

3578

3579

3580

6.1.2 Multiple Alignments

3581

==========================

3582

3583

The previous section focused on reading files containing a single

3584

alignment. In general however, files can contain more than one

3585

alignment, and to read these files we must use the 'Bio.AlignIO.parse()'

3586

function.

3587

Suppose you have a small alignment in PHYLIP format:

3588

<< 5 6

3589

Alpha AACAAC

3590

Beta AACCCC

3591

Gamma ACCAAC

3592

Delta CCACCA

3593

Epsilon CCAAAC

3594

>>

3595

3596

If you wanted to bootstrap a phylogenetic tree using the PHYLIP tools,

3597

one of the steps would be to create a set of many resampled alignments

3598

using the tool 'bootseq'. This would give output something like this,

3599

which has been abbreviated for conciseness:

3600

<< 5 6

3601

Alpha AAACCA

3602

Beta AAACCC

3603

Gamma ACCCCA

3604

Delta CCCAAC

3605

Epsilon CCCAAA

3606

5 6

3607

Alpha AAACAA

3608

Beta AAACCC

3609

Gamma ACCCAA

3610

Delta CCCACC

3611

Epsilon CCCAAA

3612

5 6

3613

Alpha AAAAAC

3614

Beta AAACCC

3615

Gamma AACAAC

3616

Delta CCCCCA

3617

Epsilon CCCAAC

3618

...

3619

5 6

3620

Alpha AAAACC

3621

Beta ACCCCC

3622

Gamma AAAACC

3623

Delta CCCCAA

3624

Epsilon CAAACC

3625

>>

3626

3627

If you wanted to read this in using 'Bio.AlignIO' you could use:

3628

<<from Bio import AlignIO

3629

alignments = AlignIO.parse(open("resampled.phy"), "phylip")

3630

for alignment in alignments :

3631

print alignment

3632

print

3633

>>

3634

3635

This would give the following output, again abbreviated for display:

3636

<<SingleLetterAlphabet() alignment with 5 rows and 6 columns

3637

AAACCA Alpha

3638

AAACCC Beta

3639

ACCCCA Gamma

3640

CCCAAC Delta

3641

CCCAAA Epsilon

3642

3643

SingleLetterAlphabet() alignment with 5 rows and 6 columns

3644

AAACAA Alpha

3645

AAACCC Beta

3646

ACCCAA Gamma

3647

CCCACC Delta

3648

CCCAAA Epsilon

3649

3650

SingleLetterAlphabet() alignment with 5 rows and 6 columns

3651

AAAAAC Alpha

3652

AAACCC Beta

3653

AACAAC Gamma

3654

CCCCCA Delta

3655

CCCAAC Epsilon

3656

3657

...

3658

3659

SingleLetterAlphabet() alignment with 5 rows and 6 columns

3660

AAAACC Alpha

3661

ACCCCC Beta

3662

AAAACC Gamma

3663

CCCCAA Delta

3664

CAAACC Epsilon

3665

>>

3666

3667

As with the function 'Bio.SeqIO.parse()', using 'Bio.AlignIO.parse()'

3668

returns an iterator. If you want to keep all the alignments in memory at

3669

once, which will allow you to access them in any order, then turn the

3670

iterator into a list:

3671

<<from Bio import AlignIO

3672

alignments = list(AlignIO.parse(open("resampled.phy"), "phylip"))

3673

last_align = alignments[-1]

3674

first_align = alignments[0]

3675

>>

3676

3677

3678

3679

6.1.3 Ambiguous Alignments

3680

===========================

3681

Many alignment file formats can explicitly store more than one

3682

alignment, and the division between each alignment is clear. However,

3683

when a general sequence file format has been used there is no such block

3684

structure. The most common such situation is when alignments have been

3685

saved in the FASTA file format. For example consider the following:

3686

<<>Alpha

3687

ACTACGACTAGCTCAG--G

3688

>Beta

3689

ACTACCGCTAGCTCAGAAG

3690

>Gamma

3691

ACTACGGCTAGCACAGAAG

3692

>Alpha

3693

ACTACGACTAGCTCAGG--

3694

>Beta

3695

ACTACCGCTAGCTCAGAAG

3696

>Gamma

3697

ACTACGGCTAGCACAGAAG

3698

>>

3699

3700

This could be a single alignment containing six sequences (with

3701

repeated identifiers). Or, judging from the identifiers, this is

3702

probably two different alignments each with three sequences, which

3703

happen to all have the same length.

3704

What about this next example?

3705

<<>Alpha

3706

ACTACGACTAGCTCAG--G

3707

>Beta

3708

ACTACCGCTAGCTCAGAAG

3709

>Alpha

3710

ACTACGACTAGCTCAGG--

3711

>Gamma

3712

ACTACGGCTAGCACAGAAG

3713

>Alpha

3714

ACTACGACTAGCTCAGG--

3715

>Delta

3716

ACTACGGCTAGCACAGAAG

3717

>>

3718

3719

Again, this could be a single alignment with six sequences. However

3720

this time based on the identifiers we might guess this is three pairwise

3721

alignments which by chance have all got the same lengths.

3722

This final example is similar:

3723

<<>Alpha

3724

ACTACGACTAGCTCAG--G

3725

>XXX

3726

ACTACCGCTAGCTCAGAAG

3727

>Alpha

3728

ACTACGACTAGCTCAGG

3729

>YYY

3730

ACTACGGCAAGCACAGG

3731

>Alpha

3732

--ACTACGAC--TAGCTCAGG

3733

>ZZZ

3734

GGACTACGACAATAGCTCAGG

3735

>>

3736

3737

In this third example, because of the differing lengths, this cannot

3738

be treated as a single alignment containing all six records. However, it

3739

could be three pairwise alignments.

3740

Clearly trying to store more than one alignment in a FASTA file is not

3741

ideal. However, if you are forced to deal with these as input files

3742

'Bio.AlignIO' can cope with the most common situation where all the

3743

alignments have the same number of records. One example of this is a

3744

collection of pairwise alignments, which can be produced by the EMBOSS

3745

tools 'needle' and 'water' -- although in this situation, 'Bio.AlignIO'

3746

should be able to understand their native output using "emboss" as the

3747

format string.

3748

To interpret these FASTA examples as several separate alignments, we

3749

can use 'Bio.AlignIO.parse()' with the optional 'seq_count' argument

3750

which specifies how many sequences are expected in each alignment (in

3751

these examples, 3, 2 and 2 respectively). For example, using the third

3752

example as the input data:

3753

<<for alignment in AlignIO.parse(handle, "fasta", seq_count=2) :

3754

print "Alignment length %i" % alignment.get_alignment_length()

3755

for record in alignment :

3756

print "%s - %s" % (record.seq, record.id)

3757

print

3758

>>

3759

3760

giving:

3761

<<Alignment length 19

3762

ACTACGACTAGCTCAG--G - Alpha

3763

ACTACCGCTAGCTCAGAAG - XXX

3764

3765

Alignment length 17

3766

ACTACGACTAGCTCAGG - Alpha

3767

ACTACGGCAAGCACAGG - YYY

3768

3769

Alignment length 21

3770

--ACTACGAC--TAGCTCAGG - Alpha

3771

GGACTACGACAATAGCTCAGG - ZZZ

3772

>>

3773

3774

Using 'Bio.AlignIO.read()' or 'Bio.AlignIO.parse()' without the

3775

'seq_count' argument would give a single alignment containing all six

3776

records for the first two examples. For the third example, an exception

3777

would be raised because the lengths differ preventing them being turned

3778

into a single alignment.

3779

If the file format itself has a block structure allowing 'Bio.AlignIO'

3780

to determine the number of sequences in each alignment directly, then

3781

the 'seq_count' argument is not needed. If it is supplied, and doesn't

3782

agree with the file contents, an error is raised.

3783

Note that this optional 'seq_count' argument assumes each alignment in

3784

the file has the same number of sequences. Hypothetically you may come

3785

across stranger situations, for example a FASTA file containing several

3786

alignments each with a different number of sequences -- although I would

3787

love to hear of a real world example of this. Assuming you cannot get

3788

the data in a nicer file format, there is no straight forward way to

3789

deal with this using 'Bio.AlignIO'. In this case, you could consider

3790

reading in the sequences themselves using 'Bio.SeqIO' and batching them

3791

together to create the alignments as appropriate.

3792

3793

3794

6.2 Writing Alignments

3795

*=*=*=*=*=*=*=*=*=*=*=*

3796

3797

3798

We've talked about using 'Bio.AlignIO.read()' and

3799

'Bio.AlignIO.parse()' for alignment input (reading files), and now we'll

3800

look at 'Bio.AlignIO.write()' which is for alignment output (writing

3801

files). This is a function taking three arguments: some 'Alignment'

3802

objects, a handle to write to, and a sequence format.

3803

Here is an example, where we start by creating a few 'Alignment'

3804

objects the hard way (by hand, rather than by loading them from a file):

3805

<<from Bio.Align.Generic import Alignment

3806

from Bio.Alphabet import IUPAC, Gapped

3807

alphabet = Gapped(IUPAC.unambiguous_dna)

3808

3809

align1 = Alignment(alphabet)

3810

align1.add_sequence("Alpha", "ACTGCTAGCTAG")

3811

align1.add_sequence("Beta", "ACT-CTAGCTAG")

3812

align1.add_sequence("Gamma", "ACTGCTAGDTAG")

3813

3814

align2 = Alignment(alphabet)

3815

align2.add_sequence("Delta", "GTCAGC-AG")

3816

align2.add_sequence("Epislon","GACAGCTAG")

3817

align2.add_sequence("Zeta", "GTCAGCTAG")

3818

3819

align3 = Alignment(alphabet)

3820

align3.add_sequence("Eta", "ACTAGTACAGCTG")

3821

align3.add_sequence("Theta", "ACTAGTACAGCT-")

3822

align3.add_sequence("Iota", "-CTACTACAGGTG")

3823

3824

my_alignments = [align1, align2, align3]

3825

>>

3826

3827

Now we have a list of 'Alignment' objects, we'll write them to a

3828

PHYLIP format file:

3829

<<from Bio import AlignIO

3830

handle = open("my_example.phy", "w")

3831

SeqIO.write(my_alignments, handle, "phylip")

3832

handle.close()

3833

>>

3834

3835

And if you open this file in your favourite text editor it should look

3836

like this:

3837

<< 3 12

3838

Alpha ACTGCTAGCT AG

3839

Beta ACT-CTAGCT AG

3840

Gamma ACTGCTAGDT AG

3841

3 9

3842

Delta GTCAGC-AG

3843

Epislon GACAGCTAG

3844

Zeta GTCAGCTAG

3845

3 13

3846

Eta ACTAGTACAG CTG

3847

Theta ACTAGTACAG CT-

3848

Iota -CTACTACAG GTG

3849

>>

3850

3851

Its more common to want to load an existing alignment, and save that,

3852

perhaps after some simple manipulation like removing certain rows or

3853

columns.

3854

Suppose you wanted to know how many alignments the

3855

'Bio.AlignIO.write()' function wrote to the handle? If your alignments

3856

were in a list like the example above, you could just use

3857

'len(my_alignments)', however you can't do that when your records come

3858

from a generator/iterator. Therefore as of Biopython 1.49, the

3859

'Bio.AlignIO.write()' function returns the number of alignments written

3860

to the file.

3861

3862

3863

6.2.1 Converting between sequence alignment file formats

3864

=========================================================

3865

3866

Converting between sequence alignment file formats with 'Bio.AlignIO'

3867

works in the same way as converting between sequence file formats with

3868

'Bio.SeqIO' -- we load generally the alignment(s) using

3869

'Bio.AlignIO.parse()' and then save them using the

3870

'Bio.AlignIO.write()'.

3871

For this example, we'll load the PFAM/Stockholm format file used

3872

earlier and save it as a Clustal W format file:

3873

<<from Bio import AlignIO

3874

alignments = AlignIO.parse(open("PF05371_seed.sth"), "stockholm")

3875

handle = open("PF05371_seed.aln","w")

3876

AlignIO.write(alignments, handle, "clustal")

3877

handle.close()

3878

>>

3879

3880

The 'Bio.AlignIO.write()' function expects to be given multiple

3881

alignment objects. In the example above we gave it the alignment

3882

iterator returned by 'Bio.AlignIO.parse()'.

3883

In this case, we know there is only one alignment in the file so we

3884

could have used 'Bio.AlignIO.read()' instead, but notice we have to pass

3885

this alignment to 'Bio.AlignIO.write()' as a single element list:

3886

<<from Bio import AlignIO

3887

alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm")

3888

handle = open("PF05371_seed.aln","w")

3889

AlignIO.write([alignment], handle, "clustal")

3890

handle.close()

3891

>>

3892

3893

Either way, you should end up with the same new Clustal W format file

3894

"PF05371_seed.aln" with the following content:

3895

<<CLUSTAL X (1.81) multiple sequence alignment

3896

3897

3898

COATB_BPIKE/30-81

3899

AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS

3900

Q9T0Q8_BPIKE/1-52

3901

AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS

3902

COATB_BPI22/32-83

3903

DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS

3904

COATB_BPM13/24-72

3905

AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS

3906

COATB_BPZJ2/1-49

3907

AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS

3908

Q9T0Q9_BPFD/1-49

3909

AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS

3910

COATB_BPIF1/22-73

3911

FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS

3912

3913

COATB_BPIKE/30-81 KA

3914

Q9T0Q8_BPIKE/1-52 RA

3915

COATB_BPI22/32-83 KA

3916

COATB_BPM13/24-72 KA

3917

COATB_BPZJ2/1-49 KA

3918

Q9T0Q9_BPFD/1-49 KA

3919

COATB_BPIF1/22-73 RA

3920

>>

3921

3922

Alternatively, you could make a PHYLIP format file which we'll name

3923

"PF05371_seed.phy":

3924

<<from Bio import AlignIO

3925

alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm")

3926

handle = open("PF05371_seed.phy","w")

3927

AlignIO.write([alignment], handle, "phylip")

3928

handle.close()

3929

>>

3930

3931

This time the output looks like this:

3932

<< 7 52

3933

COATB_BPIK AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIRLFKKFSS

3934

Q9T0Q8_BPI AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIKLFKKFVS

3935

COATB_BPI2 DGTSTATSYA TEAMNSLKTQ ATDLIDQTWP VVTSVAVAGL AIRLFKKFSS

3936

COATB_BPM1 AEGDDP---A KAAFNSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS

3937

COATB_BPZJ AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFAS

3938

Q9T0Q9_BPF AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS

3939

COATB_BPIF FAADDATSQA KAAFDSLTAQ ATEMSGYAWA LVVLVVGATV GIKLFKKFVS

3940

3941

KA

3942

RA

3943

KA

3944

KA

3945

KA

3946

KA

3947

RA

3948

>>

3949

3950

One of the big handicaps of the PHYLIP alignment file format is that

3951

the sequence identifiers are strictly truncated at ten characters. In

3952

this example, as you can see the resulting names are still unique - but

3953

they are not very readable. In this particular case, there is no clear

3954

way to compress the identifers, but for the sake of argument you may

3955

want to assign your own names or numbering system. This following bit of

3956

code manipulates the record identifiers before saving the output:

3957

<<from Bio import AlignIO

3958

alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm")

3959

name_mapping = {}

3960

for i, record in enumerate(alignment) :

3961

name_mapping[i] = record.id

3962

record.id = "seq%i" % i

3963

print name_mapping

3964

3965

handle = open("PF05371_seed.phy","w")

3966

AlignIO.write([alignment], handle, "phylip")

3967

handle.close()

3968

>>

3969

3970

This code used a Python dictionary to record a simple mapping from the

3971

new sequence system to the original identifier:

3972

<<{0: 'COATB_BPIKE/30-81', 1: 'Q9T0Q8_BPIKE/1-52', 2:

3973

'COATB_BPI22/32-83', ...}

3974

>>

3975

3976

Here is the new PHYLIP format output:

3977

<< 7 52

3978

seq0 AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIRLFKKFSS

3979

seq1 AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIKLFKKFVS

3980

seq2 DGTSTATSYA TEAMNSLKTQ ATDLIDQTWP VVTSVAVAGL AIRLFKKFSS

3981

seq3 AEGDDP---A KAAFNSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS

3982

seq4 AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFAS

3983

seq5 AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS

3984

seq6 FAADDATSQA KAAFDSLTAQ ATEMSGYAWA LVVLVVGATV GIKLFKKFVS

3985

3986

KA

3987

RA

3988

KA

3989

KA

3990

KA

3991

KA

3992

RA

3993

>>

3994

3995

In general, because of the identifier limitation, working with PHYLIP

3996

file formats shouldn't be your first choice. Using the PFAM/Stockholm

3997

format on the other hand allows you to record a lot of additional

3998

annotation too.

3999

4000

4001

6.2.2 Getting your Alignment objects as formatted strings

4002

==========================================================

4003

The 'Bio.AlignIO' interface is based on handles, which means if you

4004

want to get your alignment(s) into a string in a particular file format

4005

you need to do a little bit more work (see below). However, you will

4006

probably prefer to take advantage of the new 'format()' method added to

4007

the 'Alignment' object in Biopython 1.48. This takes a single mandatory

4008

argument, a lower case string which is supported by 'Bio.AlignIO' as an

4009

output format. For example:

4010

<<from Bio import AlignIO

4011

alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm")

4012

print alignment.format("clustal")

4013

>>

4014

4015

As described in Section 4.5), the 'SeqRecord' object has a similar

4016

method using output formats supported by 'Bio.SeqIO'.

4017

Internally the 'format()' method is using the 'StringIO' string based

4018

handle and calling 'Bio.AlignIO.write()'. You can do this in your own

4019

code if for example you are using an older version of Biopython:

4020

<<from Bio import AlignIO

4021

from StringIO import StringIO

4022

4023

alignments = AlignIO.parse(open("PF05371_seed.sth"), "stockholm")

4024

4025

out_handle = StringIO()

4026

AlignIO.write(alignments, out_handle, "clustal")

4027

clustal_data = out_handle.getvalue()

4028

4029

print clustal_data

4030

>>

4031

4032

4033

4034

6.3 Alignment Tools

4035

*=*=*=*=*=*=*=*=*=*=

4036

4037

4038

There are lots of algorithms out there for aligning sequences, both

4039

pairwise alignments and multiple sequence alignments. These calculations

4040

are relatively slow, and you generally wouldn't want to write such an

4041

algorithm in Python. Instead, you can use Biopython to invoke a command

4042

line tool on your behalf. Normally you would:

4043

4044

1. Prepare an input file of your unaligned sequences, typically this

4045

will be a FASTA file which you might create using 'Bio.SeqIO' (see

4046

Chapter 5).

4047

2. Call the command line tool to process this input file, typically

4048

via one of Biopython's command line wrappers (which we'll discuss

4049

here).

4050

3. Read the output from the tool, i.e. your aligned sequences,

4051

typically using 'Bio.AlignIO' (see earlier in this chapter).

4052

4053

All the command line wrappers we're going to talk about in this

4054

chapter follow the same style. You create a command line object

4055

specifying the options (e.g. the input filename and the output

4056

filename), then invoke this command line via a Python operating system

4057

call (e.g. using the subprocess module).

4058

Most of these wrappers are defined in the 'Bio.Align.Applications'

4059

module:

4060

<<>>> import Bio.Align.Applications

4061

>>> dir(Bio.Align.Applications)

4062

...

4063

['ClustalwCommandline', 'DialignCommandline', 'MafftCommandline',

4064

'MuscleCommandline',

4065

'PrankCommandline', 'ProbconsCommandline', 'TCoffeeCommandline' ...]

4066

>>

4067

4068

(Ignore the entries starting with an underscore -- these have special

4069

meaning in Python.) The module 'Bio.Emboss.Applications' has wrappers

4070

for some of the EMBOSS suite (2), including needle and water, which are

4071

described below in Section 6.3.5. We won't explore all these alignment

4072

tools here in the section, just a sample, but the same principles apply.

4073

4074

4075

6.3.1 ClustalW

4076

===============

4077

ClustalW is a popular command line tool for multiple sequence

4078

alignment (there is also a graphical interface called ClustalX).

4079

Biopython's 'Bio.Align.Applications' module has a wrapper for this

4080

alignment tool (and several others).

4081

Before trying to use ClustalW from within Python, you should first try

4082

running the ClustalW tool yourself by hand at the command line, to

4083

familiarise yourself the other options. You'll find the Biopython

4084

wrapper is very faithful to the actual command line API:

4085

<<>>> from Bio.Align.Applications import ClustalwCommandline

4086

>>> help(ClustalwCommandline)

4087

...

4088

>>

4089

4090

For the most basic usage, all you need is to have a FASTA input file,

4091

such as opuntia.fasta (3) (available online or in the Doc/examples

4092

subdirectory of the Biopython source code). This is a small FASTA file

4093

containing seven prickly-pear DNA sequences (from the cactus family

4094

Opuntia).

4095

By default ClustalW will generate an alignment and guide tree file

4096

with names based on the input FASTA file, in this case opuntia.aln and

4097

opuntia.dnd, but you can override this or make it explicit:

4098

<<>>> from Bio.Align.Applications import ClustalwCommandline

4099

>>> cline = ClustalwCommandline("clustalw2", infile="opuntia.fasta")

4100

>>> print cline

4101

clustalw2 -infile=opuntia.fasta

4102

>>

4103

4104

Notice here we have given the executable name as clustalw2, indicating

4105

we have version two installed, which has a different filename to version

4106

one (clustalw, the default). Fortunately both versions support the same

4107

set of arguments at the command line (and indeed, should be functionally

4108

identical).

4109

You may find that even though you have ClustalW installed, the above

4110

command doesn't work -- you may get a message about "command not found"

4111

(especially on Windows). This indicated that the ClustalW executable is

4112

not on your PATH (an environment variable, a list of directories to be

4113

searched). You can either update your PATH setting to include the

4114

location of your copy of ClustalW tools (how you do this will depend on

4115

your OS), or simply type in the full path of the tool. You can also tell

4116

Biopython the full path to the tool, for example:

4117

<<>>> import os

4118

>>> from Bio.Align.Applications import ClustalwCommandline

4119

>>> clustalw_exe = r"C:\Program Files\new clustal\clustalw2.exe"

4120

>>> assert os.path.isfile(clustalw_exe), "Clustal W executable

4121

missing"

4122

>>> cline = ClustalwCommandline(clustalw_exe, infile="opuntia.fasta")

4123

>>

4124

4125

Remember, in Python a default string '\n' and/or '\t' translates as a

4126

new line and/or a tab -- which is why we're put a letter "r" at the

4127

start for a raw string that isn't translated in this way. This is

4128

generally good practice when specifying a Windows style file name.

4129

You can now run this command line, and in Python the recommended way

4130

to run another program is to use the 'subprocess' module. This replaces

4131

older options like the 'os.system()' and the 'os.popen*' functions.

4132

Now, at this point it helps to know about how command line tools

4133

"work". When you run a tool at the command line, it will often print

4134

text output and/or directly to screen. This text can be captured or

4135

redirected, via two "pipes", called standard output (the normal results)

4136

and standard error (for error messages and debug messages). There is

4137

also standard input, which is any text fed into the tool. These names

4138

get shortened to stdin, stdout and stderr. When the tool finished, it

4139

has a return code (an integer), which by convention is zero for success.

4140

Now let's try an example. Unfortunately there are some subtle

4141

differences depending on your Operating System (e.g. Windows versus

4142

Unix) and how you are running Python (e.g. at the command line, or in

4143

IDLE). You may see a second terminal window appear while ClustalW works.

4144

You may see the ClustalW output appear at your Python prompt.

4145

<<>>> import sys

4146

>>> import subprocess

4147

>>> return_code = subprocess.call(str(cline),

4148

shell=(sys.platform!="win32"))

4149

...

4150

>>> print return_code

4151

0

4152

>>

4153

4154

In the case of ClustlW, when run at the command line all the important

4155

output is written directly to the output files. Everything printed to

4156

screen while you wait (via stdout or stderr) is boring and can be

4157

ignored (assuming it worked). You can explicitly tell ClustalW to send

4158

this output to "dev null", a kind of black hole for command line tool

4159

output:

4160

<<>>> import os

4161

>>> import sys

4162

>>> import subprocess

4163

>>> return_code = subprocess.call(str(cline),

4164

... stdout = open(os.devnull),

4165

... stderr = open(os.devnull),

4166

... shell=(sys.platform!="win32"))

4167

>>> print return_code

4168

0

4169

>>

4170

4171

This time ClustalW should be "quiet", because all its terminal output

4172

has been discarded. What we care about are the two output files, the

4173

alignment and the guide tree. We didn't tell ClustalW what filenames to

4174

use, but it defaults to picking names based on the input file. In this

4175

case the output should be in the file 'opuntia.aln'. You should be able

4176

to work out how to read in the alignment using 'Bio.AlignIO' by now:

4177

<<>>> from Bio import AlignIO

4178

>>> align = AlignIO.read(open("opuntia.aln"), "clustal")

4179

>>> print align

4180

SingleLetterAlphabet() alignment with 7 rows and 906 columns

4181

TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4182

gi|6273285|gb|AF191659.1|AF191

4183

TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4184

gi|6273284|gb|AF191658.1|AF191

4185

TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4186

gi|6273287|gb|AF191661.1|AF191

4187

TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4188

gi|6273286|gb|AF191660.1|AF191

4189

TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4190

gi|6273290|gb|AF191664.1|AF191

4191

TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4192

gi|6273289|gb|AF191663.1|AF191

4193

TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4194

gi|6273291|gb|AF191665.1|AF191

4195

>>

4196

4197

In case you are interested (and this is an aside from the main thrust

4198

of this chapter), the opuntia.dnd file ClustalW creates is just a

4199

standard Newick tree file, and 'Bio.Nexus' can parse these. You can

4200

either parse the file directly (as though it were a light weight NEXUS

4201

file), or just go directly to a tree object:

4202

<<>>> from Bio.Nexus import Trees

4203

>>> tree_string = open("opuntia.dnd").read()

4204

>>> tree = Trees.Tree(tree_string)

4205

>>> print tree

4206

tree a_tree =

4207

((gi|6273291|gb|AF191665.1|AF191665,(gi|6273290|gb|AF191664.1|AF191664,

4208

gi|6273289|gb|AF191663.1|AF191663)),(gi|6273287|gb|AF191661.1|AF191661

4209

,

4210

gi|6273286|gb|AF191660.1|AF191660),(gi|6273285|gb|AF191659.1|AF191659,

4211

gi|6273284|gb|AF191658.1|AF191658));

4212

>>> print tree.display()

4213

# taxon prev succ brlen blen

4214

(sum) support comment

4215

0 - None [1, 6, 9] 0.0

4216

0.0 - -

4217

1 - 0 [2, 3] 0.0077

4218

0.0077 - -

4219

2 gi|6273291|gb|AF191665.1|AF191665 1 [] 0.00418

4220

0.01188 - -

4221

3 - 1 [4, 5] 0.00083

4222

0.00853 - -

4223

4 gi|6273290|gb|AF191664.1|AF191664 3 [] 0.00189

4224

0.01042 - -

4225

5 gi|6273289|gb|AF191663.1|AF191663 3 [] 0.00145

4226

0.00998 - -

4227

6 - 0 [7, 8] 0.00014

4228

0.00014 - -

4229

7 gi|6273287|gb|AF191661.1|AF191661 6 [] 0.00489

4230

0.00503 - -

4231

8 gi|6273286|gb|AF191660.1|AF191660 6 [] 0.00295

4232

0.00309 - -

4233

9 - 0 [10, 11] 0.00125

4234

0.00125 - -

4235

10 gi|6273285|gb|AF191659.1|AF191659 9 [] 0.00094

4236

0.00219 - -

4237

11 gi|6273284|gb|AF191658.1|AF191658 9 [] 0.00018

4238

0.00143 - -

4239

4240

Root: 0

4241

None

4242

>>

4243

4244

The spacing has been adjusted here for display purposes. The tree

4245

object is actually pretty powerful! Have a look at the list of methods

4246

with dir(tree) to get a hint of this...

4247

4248

4249

6.3.2 MUSCLE

4250

=============

4251

MUSCLE is a more recent multiple sequence alignment tool than

4252

ClustalW, and Biopython also has a wrapper for it under the

4253

'Bio.Align.Applications' module. As before, we recommend you try using

4254

MUSCLE from the command line before trying it from within Python, as the

4255

Biopython wrapper is very faithful to the actual command line API:

4256

<<>>> from Bio.Align.Applications import MuscleCommandline

4257

>>> help(MuscleCommandline)

4258

...

4259

>>

4260

4261

For the most basic usage, all you need is to have a FASTA input file,

4262

such as opuntia.fasta (4) (available online or in the Doc/examples

4263

subdirectory of the Biopython source code). You can then tell MUSCLE to

4264

read in this FASTA file, and write the alignment to an output file:

4265

<<>>> from Bio.Align.Applications import MuscleCommandline

4266

>>> cline = MuscleCommandline(input="opuntia.fasta",

4267

out="opuntia.txt")

4268

>>> print cline

4269

muscle -in opuntia.fasta -out opuntia.txt

4270

>>

4271

4272

Note that MUSCLE uses "-in" and "-out" but in Biopython we have to use

4273

"input" and "out" as the keyword arguments or property names. This is

4274

because "in" is a reserved word in Python.

4275

By default MUSCLE will output the alignment as a FASTA file (using

4276

gapped sequences). The 'Bio.AlignIO' module should be able to read this

4277

alignment using format="fasta". You can also ask for ClustalW-like

4278

output:

4279

<<>>> from Bio.Align.Applications import MuscleCommandline

4280

>>> cline = MuscleCommandline(input="opuntia.fasta",

4281

out="opuntia.aln", clw=True)

4282

>>> print cline

4283

muscle -in opuntia.fasta -out opuntia.aln -clw

4284

>>

4285

4286

Or, strict ClustalW output where the original ClustalW header line is

4287

used for maximum compatibility:

4288

<<>>> from Bio.Align.Applications import MuscleCommandline

4289

>>> cline = MuscleCommandline(input="opuntia.fasta",

4290

out="opuntia.aln", clwstrict=True)

4291

>>> print cline

4292

muscle -in opuntia.fasta -out opuntia.aln -clwstrict

4293

>>

4294

4295

The 'Bio.AlignIO' module should be able to read these alignments using

4296

format="clustal".

4297

MUSCLE can also output in GCG MSF format (using the msf argument), but

4298

Biopython can't currently parse that, or using HTML which would give a

4299

human readable web page (not suitable for parsing).

4300

You can also set the other optional parameters, for example the

4301

maximum number of iterations. See the built in help for details.

4302

You would then run MUSCLE command line string as described above for

4303

ClustalW, and parse the output using 'Bio.AlignIO' to get an alignment

4304

object.

4305

4306

4307

6.3.3 MUSCLE using stdout

4308

==========================

4309

4310

Using a MUSCLE command line as in the examples above will write the

4311

alignment to a file. This means there will be no important information

4312

written to the standard out (stdout) or standard error (stderr) handles.

4313

However, by default MUSCLE will write the alignment to standard output

4314

(stdout). We can take advantage of this to avoid having a temporary

4315

output file! For example:

4316

<<>>> from Bio.Align.Applications import MuscleCommandline

4317

>>> cline = MuscleCommandline(input="opuntia.fasta")

4318

>>> print cline

4319

muscle -in opuntia.fasta

4320

>>

4321

4322

Now, let's run this and capture the output as handles. Remember that

4323

MUSCLE defaults to using FASTA as the output format:

4324

<<>>> import subprocess

4325

>>> child = subprocess.Popen(str(cline),

4326

... stdout=subprocess.PIPE,

4327

... stderr=subprocess.PIPE,

4328

... shell=(sys.platform!="win32"))

4329

>>> from Bio import AlignIO

4330

>>> align = AlignIO.read(child.stdout, "fasta")

4331

>>> print align

4332

SingleLetterAlphabet() alignment with 7 rows and 906 columns

4333

TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4334

gi|6273289|gb|AF191663.1|AF191663

4335

TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4336

gi|6273291|gb|AF191665.1|AF191665

4337

TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4338

gi|6273290|gb|AF191664.1|AF191664

4339

TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4340

gi|6273287|gb|AF191661.1|AF191661

4341

TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4342

gi|6273286|gb|AF191660.1|AF191660

4343

TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4344

gi|6273285|gb|AF191659.1|AF191659

4345

TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4346

gi|6273284|gb|AF191658.1|AF191658

4347

>>

4348

4349

4350

4351

6.3.4 MUSCLE using stdin and stdout

4352

====================================

4353

4354

We don't actually need to have our FASTA input sequences prepared in a

4355

file, because by default MUSCLE will read in the input sequence from

4356

standard input!

4357

First, we'll need some unaligned sequences in memory as 'SeqRecord'

4358

objects. For this demonstration I'm going to use a filtered version of

4359

the original FASTA file (using a generator expression), taking just six

4360

of the seven sequences:

4361

<<>>> from Bio import SeqIO

4362

>>> records = (r for r in SeqIO.parse(open("opuntia.fasta"), "fasta")

4363

if len(r) < 900)

4364

>>

4365

4366

Then we create the MUSCLE command line, leaving the input and output

4367

to their defaults (stdin and stdout). I'm also going to ask for strict

4368

ClustalW format as for the output.

4369

<<>>> from Bio.Align.Applications import MuscleCommandline

4370

>>> cline = MuscleCommandline(clwstrict=True)

4371

>>> print cline

4372

muscle -clwstrict

4373

>>

4374

4375

Now comes the clever bit using the 'subprocess' module, stdin and

4376

stdout:

4377

<<>>> import subprocess

4378

>>> import sys

4379

>>> child = subprocess.Popen(str(cline),

4380

... stdin=subprocess.PIPE,

4381

... stdout=subprocess.PIPE,

4382

... stderr=subprocess.PIPE,

4383

... shell=(sys.platform!="win32"))

4384

4385

>>

4386

4387

That should start MUSCLE, but it will be sitting waiting for its FASTA

4388

input sequences, which we must supply via its stdin handle:

4389

<<>>> SeqIO.write(records, child.stdin, "fasta")

4390

6

4391

>>> child.stdin.close()

4392

>>

4393

4394

After writing the six sequences to the handle, MUSCLE will still be

4395

waiting to see if that is all the FASTA sequences or not -- so we must

4396

signal that this is all the input data by closing the handle. At that

4397

point MUSCLE should start to run, and we can ask for the output:

4398

<<>>> from Bio import AlignIO

4399

>>> align = AlignIO.read(child.stdout, "clustal")

4400

>>> print align

4401

SingleLetterAlphabet() alignment with 6 rows and 900 columns

4402

TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4403

gi|6273290|gb|AF191664.1|AF19166

4404

TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4405

gi|6273289|gb|AF191663.1|AF19166

4406

TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4407

gi|6273287|gb|AF191661.1|AF19166

4408

TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4409

gi|6273286|gb|AF191660.1|AF19166

4410

TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4411

gi|6273285|gb|AF191659.1|AF19165

4412

TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA

4413

gi|6273284|gb|AF191658.1|AF19165

4414

>>

4415

4416

Wow! There we are with a new alignment of just the six records,

4417

without having created a temporary FASTA input file, or a temporary

4418

alignment output file. However, a word of caution: Dealing with errors

4419

with this style of calling external programs is much more complicated.

4420

It also becomes far harder to diagnose problems, because you can't try

4421

running MUSCLE manually outside of Biopython (because you don't have the

4422

input file to supply). There can also be subtle cross platform issues

4423

(e.g. Windows versus Linux), and how you run your script can have in

4424

impact (e.g. at the command line, from IDLE or an IDE, or as a GUI

4425

script). These are all generic Python issues though, and not specific to

4426

Biopython.

4427

4428

4429

6.3.5 EMBOSS needle and water

4430

==============================

4431

The EMBOSS (5) suite includes the water and needle tools for

4432

Smith-Waterman algorithm local alignment, and Needleman-Wunsch global

4433

alignment. The tools share the same style interface, so switching

4434

between the two is trivial -- we'll just use needle here.

4435

Suppose you want to do a global pairwise alignment between two

4436

sequences, prepared in FASTA format as follows:

4437

<<>HBA_HUMAN

4438

MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG

4439

KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP

4440

AVHASLDKFLASVSTVLTSKYR

4441

>>

4442

4443

in a file alpha.fasta, and secondly in a file beta.fasta:

4444

<<>HBB_HUMAN

4445

MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK

4446

VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG

4447

KEFTPPVQAAYQKVVAGVANALAHKYH

4448

>>

4449

4450

Let's start by creating a complete needle command line object in one

4451

go:

4452

<<>>> from Bio.Emboss.Applications import NeedleCommandline

4453

>>> cline = NeedleCommandline(asequence="alpha.faa",

4454

bsequence="beta.faa",

4455

... gapopen=10, gapextend=0.5,

4456

outfile="needle.txt")

4457

>>> print cline

4458

needle -outfile=needle.txt -asequence=alpha.faa -bsequence=beta.faa

4459

-gapopen=10 -gapextend=0.5

4460

>>

4461

4462

Why not try running this by hand at the command prompt? You should see

4463

it does a pairwise comparison and records the output in the file

4464

needle.txt (in the default EMBOSS alignment file format).

4465

Even if you have EMBOSS installed, running this command may not work

4466

-- you might get a message about "command not found" (especially on

4467

Windows). This probably means that the EMBOSS tools are not on your PATH

4468

environment variable. You can either update your PATH setting, or simply

4469

tell Biopython the full path to the tool, for example:

4470

<<>>> from Bio.Emboss.Applications import NeedleCommandline

4471

>>> cline = NeedleCommandline(r"C:\EMBOSS\needle.exe",

4472

... asequence="alpha.faa",

4473

bsequence="beta.faa",

4474

... gapopen=10, gapextend=0.5,

4475

outfile="needle.txt")

4476

>>

4477

4478

Remember in Python that for a default string '\n' or '\t' means a new

4479

line or a tab -- which is why we're put a letter "r" at the start for a

4480

raw string.

4481

At this point it might help to try running the EMBOSS tools yourself

4482

by hand at the command line, to familiarise yourself the other options

4483

and compare them to the Biopython help text:

4484

<<>>> from Bio.Emboss.Applications import NeedleCommandline

4485

>>> help(NeedleCommandline)

4486

...

4487

>>

4488

4489

Note that you can also specify (or change or look at) the settings

4490

like this:

4491

<<>>> from Bio.Emboss.Applications import NeedleCommandline

4492

>>> cline = NeedleCommandline()

4493

>>> cline.asequence="alpha.faa"

4494

>>> cline.bsequence="beta.faa"

4495

>>> cline.gapopen=10

4496

>>> cline.gapextend=0.5

4497

>>> cline.outfile="needle.txt"

4498

>>> print cline

4499

needle -outfile=needle.txt -asequence=alpha.faa -bsequence=beta.faa

4500

-gapopen=10 -gapextend=0.5

4501

>>> print cline.outfile

4502

needle.txt

4503

>>

4504

4505

Next we want to use Python to run this command for us. As explained

4506

above, for full control, we recommend you use the built in Python

4507

subprocess module. However, for simple usage Biopython includes a

4508

wrapper function that usually suffices:

4509

<<>>> import subprocess

4510

>>> return_code = subprocess.call(str(cline),

4511

shell=(sys.platform!="win32"))

4512

Needleman-Wunsch global alignment of two sequences

4513

>>> print return_code

4514

0

4515

>>

4516

4517

In the above, all we really care about is that the return code is zero

4518

(success). If you want to hide the message EMBOSS prints out, then (as

4519

in the ClustalW example above) send the stdout and stderr to dev null,

4520

or just set the EMBOSS auto argument to True.

4521

Next we can load the output file with 'Bio.AlignIO' as discussed

4522

earlier in this chapter, as the emboss format:

4523

<<>>> from Bio import AlignIO

4524

>>> align = AlignIO.read(open("needle.txt"), "emboss")

4525

>>> print align

4526

SingleLetterAlphabet() alignment with 2 rows and 149 columns

4527

MV-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTY...KYR HBA_HUMAN

4528

MVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRF...KYH HBB_HUMAN

4529

>>

4530

4531

In this example, we told EMBOSS to write the output to a file, but you

4532

can tell it to write the output to stdout instead (useful if you don't

4533

want a temporary output file to get rid of), and also read the input

4534

from stdin (just like in the MUSCLE example in the section above).

4535

This has only scratched the surface of what you can do with needle and

4536

water. One useful trick is that the second file can contain multiple

4537

sequences (say five), and then EMBOSS will do five pairwise alignments.

4538

Note - Biopython includes its own pairwise alignment code in the

4539

'Bio.pairwise2' module (written in C for speed, but with a pure Python

4540

fallback available too). This doesn't work with alignment objects, so we

4541

have not covered it within this chapter. See the module's docstring

4542

(built in help) for details.

4543

-----------------------------------

4544

4545

4546

(1) http://biopython.org/DIST/docs/api/Bio.AlignIO-module.html

4547

4548

(2) http://emboss.sourceforge.net/

4549

4550

(3) http://biopython.org/DIST/docs/tutorial/examples/opuntia.fasta

4551

4552

(4) http://biopython.org/DIST/docs/tutorial/examples/opuntia.fasta

4553

4554

(5) http://emboss.sourceforge.net/

4555

4556

4557

Chapter 7 BLAST

4558

******************

4559

Hey, everybody loves BLAST right? I mean, geez, how can get it get

4560

any easier to do comparisons between one of your sequences and every

4561

other sequence in the known world? But, of course, this section isn't

4562

about how cool BLAST is, since we already know that. It is about the

4563

problem with BLAST -- it can be really difficult to deal with the volume

4564

of data generated by large runs, and to automate BLAST runs in general.

4565

Fortunately, the Biopython folks know this only too well, so they've

4566

developed lots of tools for dealing with BLAST and making things much

4567

easier. This section details how to use these tools and do useful things

4568

with them.

4569

Dealing with BLAST can be split up into two steps, both of which can

4570

be done from within Biopython. Firstly, running BLAST for your query

4571

sequence(s), and getting some output. Secondly, parsing the BLAST output

4572

in Python for further analysis. We'll start by talking about running the

4573

BLAST command line tools locally, and then discuss running BLAST via the

4574

web.

4575

4576

4577

7.1 Running BLAST locally

4578

*=*=*=*=*=*=*=*=*=*=*=*=*=

4579

4580

4581

Running BLAST locally (as opposed to over the internet, see

4582

Section 7.2) has two advantages:

4583

4584

- Local BLAST may be faster than BLAST over the internet;

4585

- Local BLAST allows you to make your own database to search for

4586

sequences against.

4587

Dealing with proprietary or unpublished sequence data can be another

4588

reason to run BLAST locally. You may not be allowed to redistribute the

4589

sequences, so submitting them to the NCBI as a BLAST query would not be

4590

an option.

4591

Biopython provides lots of nice code to enable you to call local BLAST

4592

executables from your scripts, and have full access to the many command

4593

line options that these executables provide. You can obtain local BLAST

4594

precompiled for a number of platforms at

4595

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/, or can compile it

4596

yourself in the NCBI toolbox (ftp://ftp.ncbi.nlm.nih.gov/toolbox/).

4597

The code for calling local "standalone" BLAST is found in

4598

'Bio.Blast.NCBIStandalone', specifically the functions 'blastall',

4599

'blastpgp' and 'rpsblast', which correspond with the BLAST executables

4600

that their names imply.

4601

Let's use these functions to run 'blastall' against a local database

4602

and return the results. First, we want to set up the paths to everything

4603

that we'll need to do the BLAST. What we need to know is the path to the

4604

database (which should have been prepared using 'formatdb', see

4605

ftp://ftp.ncbi.nlm.nih.gov/blast/documents/formatdb.html) to search

4606

against, the path to the file we want to search, and the path to the

4607

'blastall' executable.

4608

On Linux or Mac OS X your paths might look like this:

4609

<<>>> my_blast_db = "/home/mdehoon/Data/Genomes/Databases/bsubtilis"

4610

# I used formatdb to create a BLAST database named bsubtilis

4611

# (for Bacillus subtilis) consisting of the following three files:

4612

# /home/mdehoon/Data/Genomes/Databases/bsubtilis.nhr

4613

# /home/mdehoon/Data/Genomes/Databases/bsubtilis.nin

4614

# /home/mdehoon/Data/Genomes/Databases/bsubtilis.nsq

4615

4616

>>> my_blast_file = "m_cold.fasta"

4617

# A FASTA file with the sequence I want to BLAST

4618

4619

>>> my_blast_exe = "/usr/local/blast/bin/blastall"

4620

# The name of my BLAST executable

4621

>>

4622

4623

while on Windows you might have something like this:

4624

<<>>> my_blast_db = r"C:\Blast\Data\bsubtilis"

4625

# Assuming you used formatdb to create a BLAST database named

4626

bsubtilis

4627

# (for Bacillus subtilis) consisting of the following three files:

4628

# C:\Blast\Data\bsubtilis\bsubtilis.nhr

4629

# C:\Blast\Data\bsubtilis\bsubtilis.nin

4630

# C:\Blast\Data\bsubtilis\bsubtilis.nsq

4631

>>> my_blast_file = "m_cold.fasta"

4632

>>> my_blast_exe =r"C:\Blast\bin\blastall.exe"

4633

>>

4634

4635

The FASTA file used in this example is available here (1) as well as

4636

online (2).

4637

Now that we've got that all set, we are ready to run the BLAST and

4638

collect the results. We can do this with two lines:

4639

<<>>> from Bio.Blast import NCBIStandalone

4640

>>> result_handle, error_handle =

4641

NCBIStandalone.blastall(my_blast_exe, "blastn",

4642

my_blast_db,

4643

my_blast_file)

4644

>>

4645

4646

Note that the Biopython interfaces to local blast programs returns two

4647

values. The first is a handle to the blast output, which is ready to

4648

either be saved or passed to a parser. The second is the possible error

4649

output generated by the blast command. See Section 18.1 for more about

4650

handles.

4651

The error info can be hard to deal with, because if you try to do a

4652

'error_handle.read()' and there was no error info returned, then the

4653

'read()' call will block and not return, locking your script. In my

4654

opinion, the best way to deal with the error is only to print it out if

4655

you are not getting 'result_handle' results to be parsed, but otherwise

4656

to leave it alone.

4657

This command will generate BLAST output in XML format, as that is the

4658

format expected by the XML parser, described in Section 7.4. For plain

4659

text output, use the 'align_view="0"' keyword. To parse text output

4660

instead of XML output, see Section 7.6 below. However, parsing text

4661

output is not recommended, as the BLAST plain text output changes

4662

frequently, breaking our parsers.

4663

If you are interested in saving your results to a file before parsing

4664

them, see Section 7.3. To find out how to parse the BLAST results, go to

4665

Section 7.4

4666

4667

4668

7.2 Running BLAST over the Internet

4669

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

4670

4671

4672

We use the function 'qblast()' in the 'Bio.Blast.NCBIWWW' module call

4673

the online version of BLAST. This has three non-optional arguments:

4674

4675

- The first argument is the blast program to use for the search, as a

4676

lower case string. The options and descriptions of the programs are

4677

available at http://www.ncbi.nlm.nih.gov/BLAST/blast_program.html.

4678

Currently 'qblast' only works with blastn, blastp, blastx, tblast and

4679

tblastx.

4680

- The second argument specifies the databases to search against.

4681

Again, the options for this are available on the NCBI web pages at

4682

http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.html.

4683

- The third argument is a string containing your query sequence. This

4684

can either be the sequence itself, the sequence in fasta format, or

4685

an identifier like a GI number.

4686

4687

The 'qblast' function also take a number of other option arguments

4688

which are basically analogous to the different parameters you can set on

4689

the BLAST web page. We'll just highlight a few of them here:

4690

4691

4692

- The 'qblast' function can return the BLAST results in various

4693

formats, which you can choose with the optional 'format_type'

4694

keyword: '"HTML"', '"Text"', '"ASN.1"', or '"XML"'. The default is

4695

'"XML"', as that is the format expected by the parser, described in

4696

section 7.4 below.

4697

- The argument 'expect' sets the expectation or e-value threshold.

4698

4699

For more about the optional BLAST arguments, we refer you to the

4700

NCBI's own documentation, or that built into Biopython:

4701

<<>>> from Bio.Blast import NCBIWWW

4702

>>> help(NCBIWWW.qblast)

4703

>>

4704

4705

For example, if you have a nucleotide sequence you want to search

4706

against the non-redundant database using BLASTN, and you know the GI

4707

number of your query sequence, you can use:

4708

<<>>> from Bio.Blast import NCBIWWW

4709

>>> result_handle = NCBIWWW.qblast("blastn", "nr", "8332116")

4710

>>

4711

4712

Alternatively, if we have our query sequence already in a FASTA

4713

formatted file, we just need to open the file and read in this record as

4714

a string, and use that as the query argument:

4715

<<>>> from Bio.Blast import NCBIWWW

4716

>>> fasta_string = open("m_cold.fasta").read()

4717

>>> result_handle = NCBIWWW.qblast("blastn", "nr", fasta_string)

4718

>>

4719

4720

We could also have read in the FASTA file as a 'SeqRecord' and then

4721

supplied just the sequence itself:

4722

<<>>> from Bio.Blast import NCBIWWW

4723

>>> from Bio import SeqIO

4724

>>> record = SeqIO.read(open("m_cold.fasta"), format="fasta")

4725

>>> result_handle = NCBIWWW.qblast("blastn", "nr", record.seq)

4726

>>

4727

4728

Supplying just the sequence means that BLAST will assign an identifier

4729

for your sequence automatically. You might prefer to use the 'SeqRecord'

4730

object's format method to make a fasta string (which will include the

4731

existing identifier):

4732

<<>>> from Bio.Blast import NCBIWWW

4733

>>> from Bio import SeqIO

4734

>>> record = SeqIO.read(open("m_cold.fasta"), format="fasta")

4735

>>> result_handle = NCBIWWW.qblast("blastn", "nr",

4736

record.format("fasta"))

4737

>>

4738

4739

This approach makes more sense if you have your sequence(s) in a

4740

non-FASTA file format which you can extract using 'Bio.SeqIO' (see

4741

Chapter 5).

4742

Whatever arguments you give the 'qblast()' function, you should get

4743

back your results in a handle object (by default in XML format). The

4744

next step would be to parse the XML output into Python objects

4745

representing the search results (Section 7.4), but you might want to

4746

save a local copy of the output file first.

4747

4748

4749

7.3 Saving BLAST output

4750

*=*=*=*=*=*=*=*=*=*=*=*=

4751

4752

4753

Before parsing the results, it is often useful to save them into a

4754

file so that you can use them later without having to go back and

4755

re-blasting everything. I find this especially useful when debugging my

4756

code that extracts info from the BLAST files, but it could also be

4757

useful just for making backups of things you've done.

4758

If you don't want to save the BLAST output, you can skip to

4759

section 7.4. If you do, read on.

4760

We need to be a bit careful since we can use 'result_handle.read()' to

4761

read the BLAST output only once -- calling 'result_handle.read()' again

4762

returns an empty string. First, we use 'read()' and store all of the

4763

information from the handle into a string:

4764

<<>>> blast_results = result_handle.read()

4765

>>

4766

4767

Next, we save this string in a file:

4768

<<>>> save_file = open("my_blast.xml", "w")

4769

>>> save_file.write(blast_results)

4770

>>> save_file.close()

4771

>>

4772

4773

After doing this, the results are in the file 'my_blast.xml' and the

4774

variable 'blast_results' contains the BLAST results in a string form.

4775

However, the 'parse' function of the BLAST parser (described in 7.4)

4776

takes a file-handle-like object, not a plain string. To get a handle,

4777

there are two things you can do:

4778

4779

- Use the Python standard library module 'cStringIO'. The following

4780

code will turn the plain string into a handle, which we can feed

4781

directly into the BLAST parser:

4782

<<>>> import cStringIO

4783

>>> result_handle = cStringIO.StringIO(blast_results)

4784

>>

4785

4786

- Open the saved file for reading. Duh.

4787

<<>>> result_handle = open("my_blast.xml")

4788

>>

4789

4790

Now that we've got the BLAST results back into a handle again, we are

4791

ready to do something with them, so this leads us right into the parsing

4792

section.

4793

4794

4795

7.4 Parsing BLAST output

4796

*=*=*=*=*=*=*=*=*=*=*=*=*

4797

4798

4799

As mentioned above, BLAST can generate output in various formats, such

4800

as XML, HTML, and plain text. Originally, Biopython had a parser for

4801

BLAST plain text and HTML output, as these were the only output formats

4802

supported by BLAST. Unfortunately, the BLAST output in these formats

4803

kept changing, each time breaking the Biopython parsers. As keeping up

4804

with changes in BLAST became a hopeless endeavor, especially with users

4805

running different BLAST versions, we now recommend to parse the output

4806

in XML format, which can be generated by recent versions of BLAST. Not

4807

only is the XML output more stable than the plain text and HTML output,

4808

it is also much easier to parse automatically, making Biopython a whole

4809

lot more stable.

4810

Though deprecated, the parsers for BLAST output in plain text or HTML

4811

output are still available in Biopython (see Section 7.6). Use them at

4812

your own risk: they may or may not work, depending on which BLAST

4813

version you're using.

4814

You can get BLAST output in XML format in various ways. For the

4815

parser, it doesn't matter how the output was generated, as long as it is

4816

in the XML format.

4817

4818

- You can use Biopython to run BLAST locally, as described in

4819

section 7.1.

4820

- You can use Biopython to run BLAST over the internet, as described

4821

in section 7.2.

4822

- You can do the BLAST search yourself on the NCBI site through your

4823

web browser, and then save the results. You need to choose XML as the

4824

format in which to receive the results, and save the final BLAST page

4825

you get (you know, the one with all of the interesting results!) to a

4826

file.

4827

- You can also run BLAST locally without using Biopython, and save

4828

the output in a file. Again, you need to choose XML as the format in

4829

which to receive the results.

4830

The important point is that you do not have to use Biopython scripts

4831

to fetch the data in order to be able to parse it.

4832

Doing things in one of these ways, you then need to get a handle to

4833

the results. In Python, a handle is just a nice general way of

4834

describing input to any info source so that the info can be retrieved

4835

using 'read()' and 'readline()' functions. This is the type of input the

4836

BLAST parser (and most other Biopython parsers) take.

4837

If you followed the code above for interacting with BLAST through a

4838

script, then you already have 'result_handle', the handle to the BLAST

4839

results. For example, using a GI number to do an online search:

4840

<<>>> from Bio.Blast import NCBIWWW

4841

>>> result_handle = NCBIWWW.qblast("blastn", "nr", "8332116")

4842

>>

4843

4844

If instead you ran BLAST some other way, and have the BLAST output (in

4845

XML format) in the file 'my_blast.xml', all you need to do is to open

4846

the file for reading:

4847

<<>>> result_handle = open("my_blast.xml")

4848

>>

4849

4850

Now that we've got a handle, we are ready to parse the output. The

4851

code to parse it is really quite small. If you expect a single BLAST

4852

result (i.e. you used a single query):

4853

<<>>> from Bio.Blast import NCBIXML

4854

>>> blast_record = NCBIXML.read(result_handle)

4855

>>

4856

4857

or, if you have lots of results (i.e. multiple query sequences):

4858

<<>>> from Bio.Blast import NCBIXML

4859

>>> blast_records = NCBIXML.parse(result_handle)

4860

>>

4861

4862

Just like 'Bio.SeqIO' and 'Bio.AlignIO' (see Chapters 5 and 6), we

4863

have a pair of input functions, 'read' and 'parse', where 'read' is for

4864

when you have exactly one object, and 'parse' is an iterator for when

4865

you can have lots of objects -- but instead of getting 'SeqRecord' or

4866

'Alignment' objects, we get BLAST record objects.

4867

To be able to handle the situation where the BLAST file may be huge,

4868

containing thousands of results, 'NCBIXML.parse()' returns an iterator.

4869

In plain English, an iterator allows you to step through the BLAST

4870

output, retrieving BLAST records one by one for each BLAST search

4871

result:

4872

<<>>> from Bio.Blast import NCBIXML

4873

>>> blast_records = NCBIXML.parse(result_handle)

4874

>>> blast_record = blast_records.next()

4875

# ... do something with blast_record

4876

>>> blast_record = blast_records.next()

4877

# ... do something with blast_record

4878

>>> blast_record = blast_records.next()

4879

# ... do something with blast_record

4880

>>> blast_record = blast_records.next()

4881

Traceback (most recent call last):

4882

File "<stdin>", line 1, in <module>

4883

StopIteration

4884

# No further records

4885

>>

4886

4887

Or, you can use a 'for'-loop:

4888

<<>>> for blast_record in blast_records:

4889

... # Do something with blast_record

4890

>>

4891

4892

Note though that you can step through the BLAST records only once.

4893

Usually, from each BLAST record you would save the information that you

4894

are interested in. If you want to save all returned BLAST records, you

4895

can convert the iterator into a list:

4896

<<>>> blast_records = list(blast_records)

4897

>>

4898

Now you can access each BLAST record in the list with an index as

4899

usual. If your BLAST file is huge though, you may run into memory

4900

problems trying to save them all in a list.

4901

Usually, you'll be running one BLAST search at a time. Then, all you

4902

need to do is to pick up the first (and only) BLAST record in

4903

'blast_records':

4904

<<>>> from Bio.Blast import NCBIXML

4905

>>> blast_records = NCBIXML.parse(result_handle)

4906

>>> blast_record = blast_records.next()

4907

>>

4908

or more elegantly:

4909

<<>>> from Bio.Blast import NCBIXML

4910

>>> blast_record = NCBIXML.read(result_handle)

4911

>>

4912

4913

I guess by now you're wondering what is in a BLAST record.

4914

4915

4916

7.5 The BLAST record class

4917

*=*=*=*=*=*=*=*=*=*=*=*=*=*

4918

4919

4920

A BLAST Record contains everything you might ever want to extract from

4921

the BLAST output. Right now we'll just show an example of how to get

4922

some info out of the BLAST report, but if you want something in

4923

particular that is not described here, look at the info on the record

4924

class in detail, and take a gander into the code or automatically

4925

generated documentation -- the docstrings have lots of good info about

4926

what is stored in each piece of information.

4927

To continue with our example, let's just print out some summary info

4928

about all hits in our blast report greater than a particular threshold.

4929

The following code does this:

4930

<<>>> E_VALUE_THRESH = 0.04

4931

4932

>>> for alignment in blast_record.alignments:

4933

... for hsp in alignment.hsps:

4934

... if hsp.expect < E_VALUE_THRESH:

4935

... print '****Alignment****'

4936

... print 'sequence:', alignment.title

4937

... print 'length:', alignment.length

4938

... print 'e value:', hsp.expect

4939

... print hsp.query[0:75] + '...'

4940

... print hsp.match[0:75] + '...'

4941

... print hsp.sbjct[0:75] + '...'

4942

>>

4943

4944

This will print out summary reports like the following:

4945

<<****Alignment****

4946

sequence: >gb|AF283004.1|AF283004 Arabidopsis thaliana cold

4947

acclimation protein WCOR413-like protein

4948

alpha form mRNA, complete cds

4949

length: 783

4950

e value: 0.034

4951

tacttgttgatattggatcgaacaaactggagaaccaacatgctcacgtcacttttagtcccttacatat

4952

tcctc...

4953

||||||||| | ||||||||||| || |||| || || |||||||| |||||| | | ||||||||

4954

||| ||...

4955

tacttgttggtgttggatcgaaccaattggaagacgaatatgctcacatcacttctcattccttacatct

4956

tcttc...

4957

>>

4958

4959

Basically, you can do anything you want to with the info in the BLAST

4960

report once you have parsed it. This will, of course, depend on what you

4961

want to use it for, but hopefully this helps you get started on doing

4962

what you need to do!

4963

An important consideration for extracting information from a BLAST

4964

report is the type of objects that the information is stored in. In

4965

Biopython, the parsers return 'Record' objects, either 'Blast' or

4966

'PSIBlast' depending on what you are parsing. These objects are defined

4967

in 'Bio.Blast.Record' and are quite complete.

4968

Here are my attempts at UML class diagrams for the 'Blast' and

4969

'PSIBlast' record classes. If you are good at UML and see

4970

mistakes/improvements that can be made, please let me know. The Blast

4971

class diagram is shown in Figure 7.5.

4972

*images/BlastRecord.png*

4973

4974

The PSIBlast record object is similar, but has support for the rounds

4975

that are used in the iteration steps of PSIBlast. The class diagram for

4976

PSIBlast is shown in Figure 7.5.

4977

*images/PSIBlastRecord.png*

4978

4979

4980

4981

7.6 Deprecated BLAST parsers

4982

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

4983

4984

4985

Older versions of Biopython had parsers for BLAST output in plain text

4986

or HTML format. Over the years, we discovered that it is very hard to

4987

maintain these parsers in working order. Basically, any small change to

4988

the BLAST output in newly released BLAST versions tends to cause the

4989

plain text and HTML parsers to break. We therefore recommend parsing

4990

BLAST output in XML format, as described in section 7.4.

4991

The HTML parser in 'Bio.Blast.NCBIWWW' has been officially deprecated

4992

and will issue warnings if you try and use it. We plan to remove this

4993

completely in a few releases time.

4994

Our plain text BLAST parser works a bit better, but use it at your own

4995

risk. It may or may not work, depending on which BLAST versions or

4996

programs you're using.

4997

4998

4999

7.6.1 Parsing plain-text BLAST output

5000

======================================

5001

5002

The plain text BLAST parser is located in 'Bio.Blast.NCBIStandalone'.

5003

As with the XML parser, we need to have a handle object that we can

5004

pass to the parser. The handle must implement the 'readline()' method

5005

and do this properly. The common ways to get such a handle are to either

5006

use the provided 'blastall' or 'blastpgp' functions to run the local

5007

blast, or to run a local blast via the command line, and then do

5008

something like the following:

5009

<<>>> result_handle = open("my_file_of_blast_output.txt")

5010

>>

5011

5012

Well, now that we've got a handle (which we'll call 'result_handle'),

5013

we are ready to parse it. This can be done with the following code:

5014

<<>>> from Bio.Blast import NCBIStandalone

5015

>>> blast_parser = NCBIStandalone.BlastParser()

5016

>>> blast_record = blast_parser.parse(result_handle)

5017

>>

5018

5019

This will parse the BLAST report into a Blast Record class (either a

5020

Blast or a PSIBlast record, depending on what you are parsing) so that

5021

you can extract the information from it. In our case, let's just use

5022

print out a quick summary of all of the alignments greater than some

5023

threshold value.

5024

<<>>> E_VALUE_THRESH = 0.04

5025

>>> for alignment in blast_record.alignments:

5026

... for hsp in alignment.hsps:

5027

... if hsp.expect < E_VALUE_THRESH:

5028

... print '****Alignment****'

5029

... print 'sequence:', alignment.title

5030

... print 'length:', alignment.length

5031

... print 'e value:', hsp.expect

5032

... print hsp.query[0:75] + '...'

5033

... print hsp.match[0:75] + '...'

5034

... print hsp.sbjct[0:75] + '...'

5035

>>

5036

5037

If you also read the section 7.4 on parsing BLAST XML output, you'll

5038

notice that the above code is identical to what is found in that

5039

section. Once you parse something into a record class you can deal with

5040

it independent of the format of the original BLAST info you were

5041

parsing. Pretty snazzy!

5042

Sure, parsing one record is great, but I've got a BLAST file with tons

5043

of records -- how can I parse them all? Well, fear not, the answer lies

5044

in the very next section.

5045

5046

5047

7.6.2 Parsing a plain-text BLAST file full of BLAST runs

5048

=========================================================

5049

5050

We can do this using the blast iterator. To set up an iterator, we

5051

first set up a parser, to parse our blast reports in Blast Record

5052

objects:

5053

<<>>> from Bio.Blast import NCBIStandalone

5054

>>> blast_parser = NCBIStandalone.BlastParser()

5055

>>

5056

5057

Then we will assume we have a handle to a bunch of blast records,

5058

which we'll call 'result_handle'. Getting a handle is described in full

5059

detail above in the blast parsing sections.

5060

Now that we've got a parser and a handle, we are ready to set up the

5061

iterator with the following command:

5062

<<>>> blast_iterator = NCBIStandalone.Iterator(result_handle,

5063

blast_parser)

5064

>>

5065

5066

The second option, the parser, is optional. If we don't supply a

5067

parser, then the iterator will just return the raw BLAST reports one at

5068

a time.

5069

Now that we've got an iterator, we start retrieving blast records

5070

(generated by our parser) using 'next()':

5071

<<>>> blast_record = blast_iterator.next()

5072

>>

5073

5074

Each call to next will return a new record that we can deal with. Now

5075

we can iterate through this records and generate our old favorite, a

5076

nice little blast report:

5077

<<>>> for blast_record in blast_iterator :

5078

... E_VALUE_THRESH = 0.04

5079

... for alignment in blast_record.alignments:

5080

... for hsp in alignment.hsps:

5081

... if hsp.expect < E_VALUE_THRESH:

5082

... print '****Alignment****'

5083

... print 'sequence:', alignment.title

5084

... print 'length:', alignment.length

5085

... print 'e value:', hsp.expect

5086

... if len(hsp.query) > 75:

5087

... dots = '...'

5088

... else:

5089

... dots = ''

5090

... print hsp.query[0:75] + dots

5091

... print hsp.match[0:75] + dots

5092

... print hsp.sbjct[0:75] + dots

5093

>>

5094

5095

The iterator allows you to deal with huge blast records without any

5096

memory problems, since things are read in one at a time. I have parsed

5097

tremendously huge files without any problems using this.

5098

5099

5100

7.6.3 Finding a bad record somewhere in a huge plain-text BLAST file

5101

=====================================================================

5102

5103

One really ugly problem that happens to me is that I'll be parsing a

5104

huge blast file for a while, and the parser will bomb out with a

5105

ValueError. This is a serious problem, since you can't tell if the

5106

ValueError is due to a parser problem, or a problem with the BLAST. To

5107

make it even worse, you have no idea where the parse failed, so you

5108

can't just ignore the error, since this could be ignoring an important

5109

data point.

5110

We used to have to make a little script to get around this problem,

5111

but the 'Bio.Blast' module now includes a 'BlastErrorParser' which

5112

really helps make this easier. The 'BlastErrorParser' works very similar

5113

to the regular 'BlastParser', but it adds an extra layer of work by

5114

catching ValueErrors that are generated by the parser, and attempting to

5115

diagnose the errors.

5116

Let's take a look at using this parser -- first we define the file we

5117

are going to parse and the file to write the problem reports to:

5118

<<>>> import os

5119

>>> blast_file = os.path.join(os.getcwd(), "blast_out",

5120

"big_blast.out")

5121

>>> error_file = os.path.join(os.getcwd(), "blast_out",

5122

"big_blast.problems")

5123

>>

5124

5125

Now we want to get a 'BlastErrorParser':

5126

<<>>> from Bio.Blast import NCBIStandalone

5127

>>> error_handle = open(error_file, "w")

5128

>>> blast_error_parser = NCBIStandalone.BlastErrorParser(error_handle)

5129

>>

5130

5131

Notice that the parser take an optional argument of a handle. If a

5132

handle is passed, then the parser will write any blast records which

5133

generate a ValueError to this handle. Otherwise, these records will not

5134

be recorded.

5135

Now we can use the 'BlastErrorParser' just like a regular blast

5136

parser. Specifically, we might want to make an iterator that goes

5137

through our blast records one at a time and parses them with the error

5138

parser:

5139

<<>>> result_handle = open(blast_file)

5140

>>> iterator = NCBIStandalone.Iterator(result_handle,

5141

blast_error_parser)

5142

>>

5143

5144

We can read these records one a time, but now we can catch and deal

5145

with errors that are due to problems with Blast (and not with the parser

5146

itself):

5147

<<>>> try:

5148

... next_record = iterator.next()

5149

... except NCBIStandalone.LowQualityBlastError, info:

5150

... print "LowQualityBlastError detected in id %s" % info[1]

5151

>>

5152

5153

The '.next()' method is normally called indirectly via a 'for'-loop.

5154

Right now the 'BlastErrorParser' can generate the following errors:

5155

5156

5157

- 'ValueError' -- This is the same error generated by the regular

5158

BlastParser, and is due to the parser not being able to parse a

5159

specific file. This is normally either due to a bug in the parser, or

5160

some kind of discrepancy between the version of BLAST you are using

5161

and the versions the parser is able to handle.

5162

5163

- 'LowQualityBlastError' -- When BLASTing a sequence that is of

5164

really bad quality (for example, a short sequence that is basically a

5165

stretch of one nucleotide), it seems that Blast ends up masking out

5166

the entire sequence and ending up with nothing to parse. In this case

5167

it will produce a truncated report that causes the parser to generate

5168

a ValueError. 'LowQualityBlastError' is reported in these cases. This

5169

error returns an info item with the following information:

5170

5171

- 'item[0]' -- The error message

5172

- 'item[1]' -- The id of the input record that caused the error.

5173

This is really useful if you want to record all of the records

5174

that are causing problems.

5175

5176

5177

As mentioned, with each error generated, the BlastErrorParser will

5178

write the offending record to the specified 'error_handle'. You can then

5179

go ahead and look and these and deal with them as you see fit. Either

5180

you will be able to debug the parser with a single blast report, or will

5181

find out problems in your blast runs. Either way, it will definitely be

5182

a useful experience!

5183

Hopefully the 'BlastErrorParser' will make it much easier to debug and

5184

deal with large Blast files.

5185

5186

5187

7.7 Dealing with PSI-BLAST

5188

*=*=*=*=*=*=*=*=*=*=*=*=*=*

5189

5190

5191

You can run the standalone verion of PSI-BLAST (the command line tool

5192

'blastpgp') using the 'blastpgp' function in the

5193

'Bio.Blast.NCBIStandalone' module. At the time of writing, the NCBI do

5194

not appear to support tools running a PSI-BLAST search via the internet.

5195

Note that the 'Bio.Blast.NCBIXML' parser can read the XML output from

5196

current versions of PSI-BLAST, but information like which sequences in

5197

each iteration is new or reused isn't present in the XML file. If you

5198

care about this information you may have more joy with the plain text

5199

output and the 'PSIBlastParser' in 'Bio.Blast.NCBIStandalone'.

5200

5201

5202

7.8 Dealing with RPS-BLAST

5203

*=*=*=*=*=*=*=*=*=*=*=*=*=*

5204

5205

5206

You can run the standalone verion of RPS-BLAST (the command line tool

5207

'rpsblast') using the 'rpsblast' function in the

5208

'Bio.Blast.NCBIStandalone' module. At the time of writing, the NCBI do

5209

not appear to support tools running an RPS-BLAST search via the

5210

internet.

5211

You can use the 'Bio.Blast.NCBIXML' parser to read the XML output from

5212

current versions of RPS-BLAST.

5213

-----------------------------------

5214

5215

5216

(1) examples/m_cold.fasta

5217

5218

(2) http://biopython.org/DIST/docs/tutorial/examples/m_cold.fasta

5219

5220

5221

Chapter 8 Accessing NCBI's Entrez databases

5222

**********************************************

5223

5224

Entrez (http://www.ncbi.nlm.nih.gov/Entrez) is a data retrieval system

5225

that provides users access to NCBI's databases such as PubMed, GenBank,

5226

GEO, and many others. You can access Entrez from a web browser to

5227

manually enter queries, or you can use Biopython's 'Bio.Entrez' module

5228

for programmatic access to Entrez. The latter allows you for example to

5229

search PubMed or download GenBank records from within a Python script.

5230

The 'Bio.Entrez' module makes use of the Entrez Programming Utilities

5231

(also known as EUtils), consisting of eight tools that are described in

5232

detail on NCBI's page at http://www.ncbi.nlm.nih.gov/entrez/utils/. Each

5233

of these tools corresponds to one Python function in the 'Bio.Entrez'

5234

module, as described in the sections below. This module makes sure that

5235

the correct URL is used for the queries, and that not more than one

5236

request is made every three seconds, as required by NCBI.

5237

The output returned by the Entrez Programming Utilities is typically

5238

in XML format. To parse such output, you have several options:

5239

5240

1. Use 'Bio.Entrez''s parser to parse the XML output into a Python

5241

object;

5242

2. Use the DOM (Document Object Model) parser in Python's standard

5243

library;

5244

3. Use the SAX (Simple API for XML) parser in Python's standard

5245

library;

5246

4. Read the XML output as raw text, and parse it by string searching

5247

and manipulation.

5248

For the DOM and SAX parsers, see the Python documentation. The parser

5249

in 'Bio.Entrez' is discussed below.

5250

NCBI uses DTD (Document Type Definition) files to describe the

5251

structure of the information contained in XML files. Most of the DTD

5252

files used by NCBI are included in the Biopython distribution. The

5253

'Bio.Entrez' parser makes use of the DTD files when parsing an XML file

5254

returned by NCBI Entrez.

5255

Occasionally, you may find that the DTD file associated with a

5256

specific XML file is missing in the Biopython distribution. In

5257

particular, this may happen when NCBI updates its DTD files. If this

5258

happens, 'Entrez.read' will give an error message showing which DTD file

5259

is missing. You can download the DTD file from NCBI; most can be found

5260

at http://www.ncbi.nlm.nih.gov/dtd/ or

5261

http://eutils.ncbi.nlm.nih.gov/entrez/query/DTD/. After downloading, the

5262

DTD file should be stored in the directory

5263

'...site-packages/Bio/Entrez/DTDs', containing the other DTD files.

5264

Alternatively, if you installed Biopython from source, you can add the

5265

DTD file to the source code's 'Bio/Entrez/DTDs' directory, and reinstall

5266

Biopython. This will install the new DTD file in the correct location

5267

together with the other DTD files.

5268

The Entrez Programming Utilities can also generate output in other

5269

formats, such as the Fasta or GenBank file formats for sequence

5270

databases, or the MedLine format for the literature database, discussed

5271

in Section 8.10.

5272

5273

5274

8.1 Entrez Guidelines

5275

*=*=*=*=*=*=*=*=*=*=*=

5276

5277

Before using Biopython to access the NCBI's online resources (via

5278

'Bio.Entrez' or some of the other modules), please read the NCBI's

5279

Entrez User Requirements (1). If the NCBI finds you are abusing their

5280

systems, they can and will ban your access!

5281

To paraphrase:

5282

5283

5284

- For any series of more than 100 requests, do this at weekends or

5285

outside USA peak times. This is up to you to obey.

5286

- Use the http://eutils.ncbi.nlm.nih.gov address, not the standard

5287

NCBI Web address. Biopython uses this web address.

5288

- Make no more than three requests every seconds (relaxed from at

5289

most one request every three seconds in early 2009). This is

5290

automatically enforced by Biopython.

5291

- Use the optional email parameter so the NCBI can contact you if

5292

there is a problem. You can either explicitly set the email address

5293

as a parameter with each call to Entrez (e.g., include email =

5294

"A.N.Other@example.com" in the argument list), or as of Biopython

5295

1.48, you can set a global email address:

5296

<<>>> from Bio import Entrez

5297

>>> Entrez.email = "A.N.Other@example.com"

5298

>>

5299

Bio.Entrez will then use this email address with each call to Entrez.

5300

The example.com address is a reserved domain name specifically for

5301

documentation (RFC 2606). Please DO NOT use a random email -- it's

5302

better not to give an email at all.

5303

- If you are using Biopython within some larger software suite, use

5304

the tool parameter to specify this. The tool parameter will default

5305

to Biopython.

5306

- For large queries, the NCBI also recommend using their session

5307

history feature (the WebEnv session cookie string, see Section 8.13).

5308

This is only slightly more complicated.

5309

5310

In conclusion, be sensible with your usage levels. If you plan to

5311

download lots of data, consider other options. For example, if you want

5312

easy access to all the human genes, consider fetching each chromosome by

5313

FTP as a GenBank file, and importing these into your own BioSQL database

5314

(see Section 14.5).

5315

5316

5317

8.2 EInfo: Obtaining information about the Entrez databases

5318

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

5319

5320

EInfo provides field index term counts, last update, and available

5321

links for each of NCBI's databases. In addition, you can use EInfo to

5322

obtain a list of all database names accessible through the Entrez

5323

utilities:

5324

<<>>> from Bio import Entrez

5325

>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who

5326

you are

5327

>>> handle = Entrez.einfo()

5328

>>> result = handle.read()

5329

>>

5330

The variable 'result' now contains a list of databases in XML format:

5331

<<>>> print result

5332

<?xml version="1.0"?>

5333

<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD eInfoResult, 11 May

5334

2002//EN"

5335

"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eInfo_020511.dtd">

5336

5337

5338

<DbName>pubmed</DbName>

5339

<DbName>protein</DbName>

5340

<DbName>nucleotide</DbName>

5341

<DbName>nuccore</DbName>

5342

<DbName>nucgss</DbName>

5343

<DbName>nucest</DbName>

5344

<DbName>structure</DbName>

5345

<DbName>genome</DbName>

5346

<DbName>books</DbName>

5347

<DbName>cancerchromosomes</DbName>

5348

5349

5350

<DbName>domains</DbName>

5351

5352

<DbName>genomeprj</DbName>

5353

<DbName>gensat</DbName>

5354

5355

5356

<DbName>homologene</DbName>

5357

<DbName>journals</DbName>

5358

5359

<DbName>ncbisearch</DbName>

5360

<DbName>nlmcatalog</DbName>

5361

5362

5363

5364

<DbName>popset</DbName>

5365

<DbName>probe</DbName>

5366

<DbName>proteinclusters</DbName>

5367

<DbName>pcassay</DbName>

5368

<DbName>pccompound</DbName>

5369

<DbName>pcsubstance</DbName>

5370

5371

<DbName>taxonomy</DbName>

5372

<DbName>toolkit</DbName>

5373

<DbName>unigene</DbName>

5374

<DbName>unists</DbName>

5375

</DbList>

5376

</eInfoResult>

5377

>>

5378

5379

Since this is a fairly simple XML file, we could extract the

5380

information it contains simply by string searching. Using 'Bio.Entrez''s

5381

parser instead, we can directly parse this XML file into a Python

5382

object:

5383

<<>>> from Bio import Entrez

5384

>>> handle = Entrez.einfo()

5385

>>> record = Entrez.read(handle)

5386

>>

5387

Now 'record' is a dictionary with exactly one key:

5388

<<>>> record.keys()

5389

[u'DbList']

5390

>>

5391

The values stored in this key is the list of database names shown in

5392

the XML above:

5393

<<>>> record["DbList"]

5394

['pubmed', 'protein', 'nucleotide', 'nuccore', 'nucgss', 'nucest',

5395

'structure', 'genome', 'books', 'cancerchromosomes', 'cdd', 'gap',

5396

'domains', 'gene', 'genomeprj', 'gensat', 'geo', 'gds', 'homologene',

5397

'journals', 'mesh', 'ncbisearch', 'nlmcatalog', 'omia', 'omim',

5398

'pmc',

5399

'popset', 'probe', 'proteinclusters', 'pcassay', 'pccompound',

5400

'pcsubstance', 'snp', 'taxonomy', 'toolkit', 'unigene', 'unists']

5401

>>

5402

5403

For each of these databases, we can use EInfo again to obtain more

5404

information:

5405

<<>>> handle = Entrez.einfo(db="pubmed")

5406

>>> record = Entrez.read(handle)

5407

>>> record["DbInfo"]["Description"]

5408

'PubMed bibliographic record'

5409

>>> record["DbInfo"]["Count"]

5410

'17989604'

5411

>>> record["DbInfo"]["LastUpdate"]

5412

'2008/05/24 06:45'

5413

>>

5414

Try 'record["DbInfo"].keys()' for other information stored in this

5415

record. One of the most useful is a list of possible search fields for

5416

use with ESearch:

5417

<<>>> for field in record["DbInfo"]["FieldList"] :

5418

... print "%(Name)s, %(FullName)s, %(Description)s" % field

5419

ALL, All Fields, All terms from all searchable fields

5420

UID, UID, Unique number assigned to publication

5421

FILT, Filter, Limits the records

5422

TITL, Title, Words in title of publication

5423

WORD, Text Word, Free text associated with publication

5424

MESH, MeSH Terms, Medical Subject Headings assigned to publication

5425

MAJR, MeSH Major Topic, MeSH terms of major importance to publication

5426

AUTH, Author, Author(s) of publication

5427

JOUR, Journal, Journal abbreviation of publication

5428

AFFL, Affiliation, Author's institutional affiliation and address

5429

...

5430

>>

5431

5432

That's a long list, but indirectly this tells you that for the PubMed

5433

database, you can do things like Jones[AUTH] to search the author field,

5434

or Sanger[AFFL] to restrict to authors at the Sanger Centre. This can be

5435

very handy - especially if you are not so familiar with a particular

5436

database.

5437

5438

5439

8.3 ESearch: Searching the Entrez databases

5440

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

5441

5442

To search any of these databases, we use 'Bio.Entrez.esearch()'. For

5443

example, let's search in PubMed for publications related to Biopython:

5444

<<>>> from Bio import Entrez

5445

>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who

5446

you are

5447

>>> handle = Entrez.esearch(db="pubmed", term="biopython")

5448

>>> record = Entrez.read(handle)

5449

>>> record["IdList"]

5450

['19304878', '18606172', '16403221', '16377612', '14871861',

5451

'14630660', '12230038']

5452

>>

5453

In this output, you see seven PubMed IDs (including 19304878 which is

5454

the PMID for the Biopython application note), which can be retrieved by

5455

EFetch (see section 8.6).

5456

You can also use ESearch to search GenBank. Here we'll do a quick

5457

search for the matK gene in Cypripedioideae orchids (see Section 8.2

5458

about EInfo for one way to find out which fields you can search in each

5459

Entrez database):

5460

<<>>> handle =

5461

Entrez.esearch(db="nucleotide",term="Cypripedioideae[Orgn] AND

5462

matK[Gene]")

5463

>>> record = Entrez.read(handle)

5464

>>> record["Count"]

5465

'25'

5466

>>> record["IdList"]

5467

['126789333', '37222967', '37222966', '37222965', ..., '61585492']

5468

>>

5469

5470

Each of the IDs (126789333, 37222967, 37222966, ...) is a GenBank

5471

identifier. See section 8.6 for information on how to actually download

5472

these GenBank records.

5473

Note that instead of a species name like Cypripedioideae[Orgn], you

5474

can restrict the search using an NCBI taxon identifier, here this would

5475

be txid158330[Orgn]. This isn't currently documented on the ESearch help

5476

page - the NCBI explained this in reply to an email query. You can often

5477

deduce the search term formatting by playing with the Entrez web

5478

interface. For example, including complete[prop] in a genome search

5479

restricts to just completed genomes.

5480

As a final example, let's get a list of computational journal titles:

5481

<<>>> handle = Entrez.esearch(db="journals", term="computational")

5482

>>> record = Entrez.read(handle)

5483

>>> record["Count"]

5484

'16'

5485

>>> record["IdList"]

5486

['30367', '33843', '33823', '32989', '33190', '33009', '31986',

5487

'34502', '8799', '22857', '32675', '20258', '33859', '32534',

5488

'32357', '32249']

5489

>>

5490

Again, we could use EFetch to obtain more information for each of

5491

these journal IDs.

5492

ESearch has many useful options --- see the ESearch help page (2) for

5493

more information.

5494

5495

5496

8.4 EPost: Uploading a list of identifiers

5497

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

5498

5499

EPost uploads a list of UIs for use in subsequent search strategies;

5500

see the EPost help page (3) for more information. It is available from

5501

Biopython through the 'Bio.Entrez.epost()' function.

5502

To give an example of when this is useful, suppose you have a long

5503

list of IDs you want to download using EFetch (maybe sequences, maybe

5504

citations -- anything). When you make a request with EFetch your list of

5505

IDs, the database etc, are all turned into a long URL sent to the

5506

server. If your list of IDs is long, this URL gets long, and long URLs

5507

can break (e.g. some proxies don't cope well).

5508

Instead, you can break this up into two steps, first uploading the

5509

list of IDs using EPost (this uses an "HTML post" internally, rather

5510

than an "HTML get", getting round the long URL problem). With the

5511

history support, you can then refer to this long list of IDs, and

5512

download the associated data with EFetch.

5513

Let's look at a simple example to see how EPost works -- uploading

5514

some PubMed identifiers:

5515

<<>>> from Bio import Entrez

5516

>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who

5517

you are

5518

>>> id_list = ["19304878", "18606172", "16403221", "16377612",

5519

"14871861", "14630660"]

5520

>>> print Entrez.epost("pubmed", id=",".join(id_list)).read()

5521

<?xml version="1.0"?>

5522

<!DOCTYPE ePostResult PUBLIC "-//NLM//DTD ePostResult, 11 May

5523

2002//EN"

5524

"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ePost_020511.dtd">

5525

5526

5527

5528

</ePostResult>

5529

>>

5530

The returned XML includes two important strings, 'QueryKey' and

5531

'WebEnv' which together define your history session. You would extract

5532

these values for use with another Entrez call such as EFetch:

5533

<<from Bio import Entrez

5534

Entrez.email = "A.N.Other@example.com" # Always tell NCBI who you

5535

are

5536

id_list = ["19304878", "18606172", "16403221", "16377612", "14871861",

5537

"14630660"]

5538

search_results = Entrez.read(Entrez.epost("pubmed",

5539

id=",".join(id_list)))

5540

webenv = search_results["WebEnv"]

5541

query_key = search_results["QueryKey"]

5542

>>

5543

Section 8.13 shows how to use the history feature.

5544

5545

5546

8.5 ESummary: Retrieving summaries from primary IDs

5547

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

5548

5549

ESummary retrieves document summaries from a list of primary IDs (see

5550

the ESummary help page (4) for more information). In Biopython, ESummary

5551

is available as 'Bio.Entrez.esummary()'. Using the search result above,

5552

we can for example find out more about the journal with ID 30367:

5553

<<>>> from Bio import Entrez

5554

>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who

5555

you are

5556

>>> handle = Entrez.esummary(db="journals", id="30367")

5557

>>> record = Entrez.read(handle)

5558

>>> record[0]["Id"]

5559

'30367'

5560

>>> record[0]["Title"]

5561

'Computational biology and chemistry'

5562

>>> record[0]["Publisher"]

5563

'Pergamon,'

5564

>>

5565

5566

5567

5568

8.6 EFetch: Downloading full records from Entrez

5569

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

5570

5571

5572

EFetch is what you use when you want to retrieve a full record from

5573

Entrez. This covers several possible databases, as described on the main

5574

EFetch Help page (5).

5575

From the Cypripedioideae example above, we can download GenBank record

5576

186972394 using 'Bio.Entrez.efetch' (see the documentation on EFetch for

5577

Sequence and other Molecular Biology Databases (6)):

5578

<<>>> from Bio import Entrez

5579

>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who

5580

you are

5581

>>> handle = Entrez.efetch(db="nucleotide", id="186972394",

5582

rettype="gb")

5583

>>> print handle.read()

5584

LOCUS EU490707 1302 bp DNA linear PLN

5585

05-MAY-2008

5586

DEFINITION Selenipedium aequinoctiale maturase K (matK) gene, partial

5587

cds;

5588

chloroplast.

5589

ACCESSION EU490707

5590

VERSION EU490707.1 GI:186972394

5591

KEYWORDS .

5592

SOURCE chloroplast Selenipedium aequinoctiale

5593

ORGANISM Selenipedium aequinoctiale

5594

Eukaryota; Viridiplantae; Streptophyta; Embryophyta;

5595

Tracheophyta;

5596

Spermatophyta; Magnoliophyta; Liliopsida; Asparagales;

5597

Orchidaceae;

5598

Cypripedioideae; Selenipedium.

5599

REFERENCE 1 (bases 1 to 1302)

5600

AUTHORS Neubig,K.M., Whitten,W.M., Carlsward,B.S., Blanco,M.A.,

5601

Endara,C.L., Williams,N.H. and Moore,M.J.

5602

TITLE Phylogenetic utility of ycf1 in orchids

5603

JOURNAL Unpublished

5604

REFERENCE 2 (bases 1 to 1302)

5605

AUTHORS Neubig,K.M., Whitten,W.M., Carlsward,B.S., Blanco,M.A.,

5606

Endara,C.L., Williams,N.H. and Moore,M.J.

5607

TITLE Direct Submission

5608

JOURNAL Submitted (14-FEB-2008) Department of Botany, University

5609

of

5610

Florida, 220 Bartram Hall, Gainesville, FL 32611-8526, USA

5611

FEATURES Location/Qualifiers

5612

source 1..1302

5613

/organism="Selenipedium aequinoctiale"

5614

/organelle="plastid:chloroplast"

5615

/mol_type="genomic DNA"

5616

/specimen_voucher="FLAS:Blanco 2475"

5617

/db_xref="taxon:256374"

5618

gene <1..>1302

5619

/gene="matK"

5620

CDS <1..>1302

5621

/gene="matK"

5622

/codon_start=1

5623

/transl_table=11

5624

/product="maturase K"

5625

/protein_id="ACC99456.1"

5626

/db_xref="GI:186972395"

5627

5628

/translation="IFYEPVEIFGYDNKSSLVLVKRLITRMYQQNFLISSVNDSNQKG

5629

5630

FWGHKHFFSSHFSSQMVSEGFGVILEIPFSSQLVSSLEEKKIPKYQNLRSIHSIFPFL

5631

5632

EDKFLHLNYVSDLLIPHPIHLEILVQILQCRIKDVPSLHLLRLLFHEYHNLNSLITSK

5633

5634

KFIYAFSKRKKRFLWLLYNSYVYECEYLFQFLRKQSSYLRSTSSGVFLERTHLYVKIE

5635

5636

HLLVVCCNSFQRILCFLKDPFMHYVRYQGKAILASKGTLILMKKWKFHLVNFWQSYFH

5637

5638

FWSQPYRIHIKQLSNYSFSFLGYFSSVLENHLVVRNQMLENSFIINLLTKKFDTIAPV

5639

5640

ISLIGSLSKAQFCTVLGHPISKPIWTDFSDSDILDRFCRICRNLCRYHSGSSKKQVLY

5641

RIKYILRLSCARTLARKHKSTVRTFMRRLGSGLLEEFFMEEE"

5642

ORIGIN

5643

1 attttttacg aacctgtgga aatttttggt tatgacaata aatctagttt

5644

agtacttgtg

5645

61 aaacgtttaa ttactcgaat gtatcaacag aattttttga tttcttcggt

5646

taatgattct

5647

121 aaccaaaaag gattttgggg gcacaagcat tttttttctt ctcatttttc

5648

ttctcaaatg

5649

181 gtatcagaag gttttggagt cattctggaa attccattct cgtcgcaatt

5650

agtatcttct

5651

241 cttgaagaaa aaaaaatacc aaaatatcag aatttacgat ctattcattc

5652

aatatttccc

5653

301 tttttagaag acaaattttt acatttgaat tatgtgtcag atctactaat

5654

accccatccc

5655

361 atccatctgg aaatcttggt tcaaatcctt caatgccgga tcaaggatgt

5656

tccttctttg

5657

421 catttattgc gattgctttt ccacgaatat cataatttga atagtctcat

5658

tacttcaaag

5659

481 aaattcattt acgccttttc aaaaagaaag aaaagattcc tttggttact

5660

atataattct

5661

541 tatgtatatg aatgcgaata tctattccag tttcttcgta aacagtcttc

5662

ttatttacga

5663

601 tcaacatctt ctggagtctt tcttgagcga acacatttat atgtaaaaat

5664

agaacatctt

5665

661 ctagtagtgt gttgtaattc ttttcagagg atcctatgct ttctcaagga

5666

tcctttcatg

5667

721 cattatgttc gatatcaagg aaaagcaatt ctggcttcaa agggaactct

5668

tattctgatg

5669

781 aagaaatgga aatttcatct tgtgaatttt tggcaatctt attttcactt

5670

ttggtctcaa

5671

841 ccgtatagga ttcatataaa gcaattatcc aactattcct tctcttttct

5672

ggggtatttt

5673

901 tcaagtgtac tagaaaatca tttggtagta agaaatcaaa tgctagagaa

5674

ttcatttata

5675

961 ataaatcttc tgactaagaa attcgatacc atagccccag ttatttctct

5676

tattggatca

5677

1021 ttgtcgaaag ctcaattttg tactgtattg ggtcatccta ttagtaaacc

5678

gatctggacc

5679

1081 gatttctcgg attctgatat tcttgatcga ttttgccgga tatgtagaaa

5680

tctttgtcgt

5681

1141 tatcacagcg gatcctcaaa aaaacaggtt ttgtatcgta taaaatatat

5682

acttcgactt

5683

1201 tcgtgtgcta gaactttggc acggaaacat aaaagtacag tacgcacttt

5684

tatgcgaaga

5685

1261 ttaggttcgg gattattaga agaattcttt atggaagaag aa

5686

//

5687

>>

5688

5689

The argument 'rettype="gb"' lets us download this record in the

5690

GenBank format. Note that until Easter 2009, the Entrez EFetch API let

5691

you use "genbank" as the return type, however the NCBI now insist on

5692

using the official return types of "gb" or "gbwithparts" (or "gp" for

5693

proteins) as described on online.

5694

Alternatively, you could for example use 'rettype="fasta"' to get the

5695

Fasta-format; see the EFetch Sequences Help page (7) for other options.

5696

The available formats depend on which database you are downloading from

5697

- see the main EFetch Help page (8).

5698

If you fetch the record in one of the formats accepted by 'Bio.SeqIO'

5699

(see Chapter 5), you could directly parse it into a 'SeqRecord':

5700

<<>>> from Bio import Entrez, SeqIO

5701

>>> handle = Entrez.efetch(db="nucleotide",

5702

id="186972394",rettype="gb")

5703

>>> record = SeqIO.read(handle, "genbank")

5704

>>> handle.close()

5705

>>> print record

5706

ID: EU490707.1

5707

Name: EU490707

5708

Description: Selenipedium aequinoctiale maturase K (matK) gene,

5709

partial cds; chloroplast.

5710

Number of features: 3

5711

...

5712

Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA...GAA',

5713

IUPACAmbiguousDNA())>>

5714

5715

Note that a more typical use would be to save the sequence data to a

5716

local file, and then parse it with 'Bio.SeqIO'. This can save you having

5717

to re-download the same file repeatedly while working on your script,

5718

and in particular places less load on the NCBI's servers. For example:

5719

<<import os

5720

from Bio import SeqIO

5721

from Bio import Entrez

5722

Entrez.email = "A.N.Other@example.com" # Always tell NCBI who you

5723

are

5724

filename = "gi_186972394.gbk"

5725

if not os.path.isfile(filename) :

5726

print "Downloading..."

5727

net_handle =

5728

Entrez.efetch(db="nucleotide",id="186972394",rettype="gb")

5729

out_handle = open(filename, "w")

5730

out_handle.write(net_handle.read())

5731

out_handle.close()

5732

net_handle.close()

5733

print "Saved"

5734

5735

print "Parsing..."

5736

record = SeqIO.read(open(filename), "genbank")

5737

print record

5738

>>

5739

5740

To get the output in XML format, which you can parse using the

5741

'Bio.Entrez.read()' function, use 'retmode="xml"':

5742

<<>>> from Bio import Entrez

5743

>>> handle = Entrez.efetch(db="nucleotide", id="186972394",

5744

retmode="xml")

5745

>>> record = Entrez.read(handle)

5746

>>> handle.close()

5747

>>> record[0]["GBSeq_definition"]

5748

'Selenipedium aequinoctiale maturase K (matK) gene, partial cds;

5749

chloroplast'

5750

>>> record[0]["GBSeq_source"]

5751

'chloroplast Selenipedium aequinoctiale'

5752

>>

5753

5754

If you want to perform a search with 'Bio.Entrez.esearch()', and then

5755

download the records with 'Bio.Entrez.efetch()', you should use the

5756

WebEnv history feature -- see Section 8.13.

5757

5758

5759

8.7 ELink: Searching for related items in NCBI Entrez

5760

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

5761

5762

5763

ELink, available from Biopython as 'Bio.Entrez.elink()', can be used

5764

to find related items in the NCBI Entrez databases. For example, let's

5765

try to find articles related to the Biopython application note published

5766

in Bioinformatics in 2009. The PubMed ID of this article is 12230038.

5767

Now we use 'Bio.Entrez.elink' to find all related items to this article:

5768

<<>>> from Bio import Entrez

5769

>>> Entrez.email = "A.N.Other@example.com"

5770

>>> pmid = "19304878"

5771

>>> handle = Entrez.elink(dbfrom="pubmed", id=pmid)

5772

>>> record = Entrez.read(handle)

5773

>>> handle.close()

5774

>>

5775

5776

The 'record' variable consists of a Python list, one for each database

5777

in which we searched. Since we specified only one PubMed ID to search

5778

for, 'record' contains only one item. This item is a dictionary

5779

containing information about our search term, as well as all the related

5780

items that were found:

5781

<<>>> record[0]["DbFrom"]

5782

'pubmed'

5783

>>> record[0]["IdList"]

5784

['19304878']

5785

>>

5786

5787

The '"LinkSetDb"' key contains the search results, stored as a list

5788

consisting of one item for each target database. In our search results,

5789

we only find hits in the PubMed database (although sub-diveded into

5790

categories):

5791

<<>>> len(record[0]["LinkSetDb"])

5792

5

5793

>>> for linksetdb in record[0]['LinkSetDb'] :

5794

... print linksetdb["DbTo"], linksetdb["LinkName"],

5795

len(linksetdb["Link"])

5796

...

5797

pubmed pubmed_pubmed 110

5798

pubmed pubmed_pubmed_combined 6

5799

pubmed pubmed_pubmed_five 6

5800

pubmed pubmed_pubmed_reviews 5

5801

pubmed pubmed_pubmed_reviews_five 5

5802

>>

5803

5804

The actual search results are stored as under the '"Link"' key. In

5805

total, 110 items were found under standard search. Let's now at the

5806

first search result:

5807

<<>>> record[0]['LinkSetDb'][0]['Link'][0]

5808

{u'Id': '19304878'}

5809

>>

5810

5811

This is the article we searched for, which doesn't help us much, so

5812

let's look at the second search result:

5813

<<>>> record[0]['LinkSetDb'][0]['Link'][1]

5814

{u'Id': '14630660'}

5815

>>

5816

5817

This paper, with PubMed ID 17316423, is about the Biopython PDB

5818

parser.

5819

We can use a loop to print out all PubMed IDs:

5820

<<>>> for link in record[0]["LinkSetDb"][0]['Link'] :

5821

... print link

5822

...

5823

{u'Id': '19304878'}

5824

{u'Id': '14630660'}

5825

{u'Id': '18689808'}

5826

{u'Id': '17121776'}

5827

{u'Id': '16377612'}

5828

{u'Id': '12368254'}

5829

......

5830

>>

5831

5832

For help on ELink, see the ELink help page (9).

5833

5834

5835

8.8 EGQuery: Obtaining counts for search terms

5836

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

5837

5838

EGQuery provides counts for a search term in each of the Entrez

5839

databases. This is particularly useful to find out how many items your

5840

search terms would find in each database without actually performing

5841

lots of separate searches with ESearch (see the example in 8.12.2

5842

below).

5843

In this example, we use 'Bio.Entrez.egquery()' to obtain the counts

5844

for "Biopython":

5845

<<>>> from Bio import Entrez

5846

>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who

5847

you are

5848

>>> handle = Entrez.egquery(term="biopython")

5849

>>> record = Entrez.read(handle)

5850

>>> for row in record["eGQueryResult"]: print row["DbName"],

5851

row["Count"]

5852

...

5853

pubmed 6

5854

pmc 62

5855

journals 0

5856

...

5857

>>

5858

See the EGQuery help page (10) for more information.

5859

5860

5861

8.9 ESpell: Obtaining spelling suggestions

5862

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

5863

5864

ESpell retrieves spelling suggestions. In this example, we use

5865

'Bio.Entrez.espell()' to obtain the correct spelling of Biopython:

5866

<<>>> from Bio import Entrez

5867

>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who

5868

you are

5869

>>> handle = Entrez.espell(term="biopythooon")

5870

>>> record = Entrez.read(handle)

5871

>>> record["Query"]

5872

'biopythooon'

5873

>>> record["CorrectedQuery"]

5874

'biopython'

5875

>>

5876

See the ESpell help page (11) for more information.

5877

5878

5879

8.10 Specialized parsers

5880

*=*=*=*=*=*=*=*=*=*=*=*=*

5881

5882

5883

The 'Bio.Entrez.read()' function can parse most (if not all) XML

5884

output returned by Entrez. Entrez typically allows you to retrieve

5885

records in other formats, which may have some advantages compared to the

5886

XML format in terms of readability (or download size).

5887

To request a specific file format from Entrez using

5888

'Bio.Entrez.efetch()' requires specifying the 'rettype' and/or 'retmode'

5889

optional arguments. The different combinations are described for each

5890

database type on the NCBI efetch webpage (12).

5891

One obvious case is you may prefer to download sequences in the FASTA

5892

or GenBank/GenPept plain text formats (which can then be parsed with

5893

'Bio.SeqIO', see Sections 5.2.1 and 8.6). For the literature databases,

5894

Biopython contains a parser for the 'MEDLINE' format used in PubMed.

5895

5896

5897

8.10.1 Parsing Medline records

5898

===============================

5899

You can find the Medline parser in 'Bio.Medline'. Suppose we want to

5900

parse the file 'pubmed_result1.txt', containing one Medline record. You

5901

can find this file in Biopython's 'Tests\Medline' directory. The file

5902

looks like this:

5903

<<PMID- 12230038

5904

OWN - NLM

5905

STAT- MEDLINE

5906

DA - 20020916

5907

DCOM- 20030606

5908

LR - 20041117

5909

PUBM- Print

5910

IS - 1467-5463 (Print)

5911

VI - 3

5912

IP - 3

5913

DP - 2002 Sep

5914

TI - The Bio* toolkits--a brief overview.

5915

PG - 296-302

5916

AB - Bioinformatics research is often difficult to do with commercial

5917

software. The

5918

Open Source BioPerl, BioPython and Biojava projects provide

5919

toolkits with

5920

...

5921

>>

5922

We first open the file and then parse it:

5923

<<>>> from Bio import Medline

5924

>>> input = open("pubmed_result1.txt")

5925

>>> record = Medline.read(input)

5926

>>

5927

The 'record' now contains the Medline record as a Python dictionary:

5928

<<>>> record["PMID"]

5929

'12230038'

5930

>>> record["AB"]

5931

'Bioinformatics research is often difficult to do with commercial

5932

software.

5933

The Open Source BioPerl, BioPython and Biojava projects provide

5934

toolkits with

5935

multiple functionality that make it easier to create customised

5936

pipelines or

5937

analysis. This review briefly compares the quirks of the underlying

5938

languages

5939

and the functionality, documentation, utility and relative advantages

5940

of the

5941

Bio counterparts, particularly from the point of view of the beginning

5942

biologist programmer.'

5943

>>

5944

The key names used in a Medline record can be rather obscure; use

5945

<<>>> help(record)

5946

>>

5947

for a brief summary.

5948

To parse a file containing multiple Medline records, you can use the

5949

'parse' function instead:

5950

<<>>> from Bio import Medline

5951

>>> input = open("pubmed_result2.txt")

5952

>>> records = Medline.parse(input)

5953

>>> for record in records:

5954

... print record["TI"]

5955

A high level interface to SCOP and ASTRAL implemented in python.

5956

GenomeDiagram: a python package for the visualization of large-scale

5957

genomic data.

5958

Open source clustering software.

5959

PDB file parser and structure class implemented in Python.

5960

>>

5961

5962

Instead of parsing Medline records stored in files, you can also parse

5963

Medline records downloaded by 'Bio.Entrez.efetch'. For example, let's

5964

look at all Medline records in PubMed related to Biopython:

5965

<<>>> from Bio import Entrez

5966

>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who

5967

you are

5968

>>> handle = Entrez.esearch(db="pubmed",term="biopython")

5969

>>> record = Entrez.read(handle)

5970

>>> record["IdList"]

5971

['19304878', '18606172', '16403221', '16377612', '14871861',

5972

'14630660', '12230038']

5973

>>

5974

We now use 'Bio.Entrez.efetch' to download these Medline records:

5975

<<>>> idlist = record["IdList"]

5976

>>> handle =

5977

Entrez.efetch(db="pubmed",id=idlist,rettype="medline",retmode="text")

5978

>>

5979

Here, we specify 'rettype="medline", retmode="text"' to obtain the

5980

Medline records in plain-text Medline format. Now we use 'Bio.Medline'

5981

to parse these records:

5982

<<>>> from Bio import Medline

5983

>>> records = Medline.parse(handle)

5984

>>> for record in records:

5985

... print record["AU"]

5986

['Cock PJ', 'Antao T', 'Chang JT', 'Chapman BA', 'Cox CJ', 'Dalke A',

5987

..., 'de Hoon MJ']

5988

['Munteanu CR', 'Gonzalez-Diaz H', 'Magalhaes AL']

5989

['Casbon JA', 'Crooks GE', 'Saqi MA']

5990

['Pritchard L', 'White JA', 'Birch PR', 'Toth IK']

5991

['de Hoon MJ', 'Imoto S', 'Nolan J', 'Miyano S']

5992

['Hamelryck T', 'Manderick B']

5993

['Mangalam H']

5994

>>

5995

5996

For comparison, here we show an example using the XML format:

5997

<<>>> idlist = record["IdList"]

5998

>>> handle =

5999

Entrez.efetch(db="pubmed",id=idlist,rettype="medline",retmode="xml")

6000

>>> records = Entrez.read(handle)

6001

>>> for record in records:

6002

... print record["MedlineCitation"]["Article"]["ArticleTitle"]

6003

Biopython: freely available Python tools for computational molecular

6004

biology and

6005

bioinformatics.

6006

Enzymes/non-enzymes classification model complexity based on

6007

composition, sequence,

6008

3D and topological indices.

6009

A high level interface to SCOP and ASTRAL implemented in python.

6010

GenomeDiagram: a python package for the visualization of large-scale

6011

genomic data.

6012

Open source clustering software.

6013

PDB file parser and structure class implemented in Python.

6014

The Bio* toolkits--a brief overview.

6015

>>

6016

6017

Note that in both of these examples, for simplicity we have naively

6018

combined ESearch and EFetch. In this situation, the NCBI would expect

6019

you to use their history feature, as illustrated in Section 8.13.

6020

6021

6022

8.10.2 Parsing GEO records

6023

===========================

6024

6025

GEO (Gene Expression Omnibus (13)) is a data repository of

6026

high-throughput gene expression and hybridization array data. The

6027

'Bio.Geo' module can be used to parse GEO-formatted data.

6028

The following code fragment shows how to parse the example GEO file

6029

'GSE16.txt' into a record and print the record:

6030

<<>>> from Bio import Geo

6031

>>> handle = open("GSE16.txt")

6032

>>> records = Geo.parse(handle)

6033

>>> for record in records:

6034

... print record

6035

>>

6036

6037

You can search the "gds" database (GEO datasets) with ESearch:

6038

<<>>> from Bio import Entrez

6039

>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who you

6040

are

6041

>>> handle = Entrez.esearch(db="gds",term="GSE16")

6042

>>> record = Entrez.read(handle)

6043

>>> record["Count"]

6044

2

6045

>>> record["IdList"]

6046

['200000016', '100000028']

6047

>>

6048

6049

From the Entrez website, UID "200000016" is GDS16 while the other hit

6050

"100000028" is for the associated platform, GPL28. Unfortunately, at the

6051

time of writing the NCBI don't seem to support downloading GEO files

6052

using Entrez (not as XML, nor in the Simple Omnibus Format in Text

6053

(SOFT) format).

6054

However, it is actually pretty straight forward to download the GEO

6055

files by FTP from ftp://ftp.ncbi.nih.gov/pub/geo/ instead. In this case

6056

you might want

6057

ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE16/GSE16_family.so

6058

ft.gz (a compressed file, see the Python module gzip).

6059

6060

6061

8.10.3 Parsing UniGene records

6062

===============================

6063

6064

UniGene is an NCBI database of the transcriptome, with each UniGene

6065

record showing the set of transcripts that are associated with a

6066

particular gene in a specific organism. A typical UniGene record looks

6067

like this:

6068

<<ID Hs.2

6069

TITLE N-acetyltransferase 2 (arylamine N-acetyltransferase)

6070

GENE NAT2

6071

CYTOBAND 8p22

6072

GENE_ID 10

6073

LOCUSLINK 10

6074

HOMOL YES

6075

6076

normal| soft tissue/muscle tissue tumor| adult

6077

RESTR_EXPR adult

6078

CHROMOSOME 8

6079

STS ACC=PMC310725P3 UNISTS=272646

6080

STS ACC=WIAF-2120 UNISTS=44576

6081

STS ACC=G59899 UNISTS=137181

6082

...

6083

STS ACC=GDB:187676 UNISTS=155563

6084

PROTSIM ORG=10090; PROTGI=6754794; PROTID=NP_035004.1; PCT=76.55;

6085

ALN=288

6086

PROTSIM ORG=9796; PROTGI=149742490; PROTID=XP_001487907.1;

6087

PCT=79.66; ALN=288

6088

PROTSIM ORG=9986; PROTGI=126722851; PROTID=NP_001075655.1;

6089

PCT=76.90; ALN=288

6090

...

6091

PROTSIM ORG=9598; PROTGI=114619004; PROTID=XP_519631.2; PCT=98.28;

6092

ALN=288

6093

6094

SCOUNT 38

6095

SEQUENCE ACC=BC067218.1; NID=g45501306; PID=g45501307; SEQTYPE=mRNA

6096

SEQUENCE ACC=NM_000015.2; NID=g116295259; PID=g116295260;

6097

SEQTYPE=mRNA

6098

SEQUENCE ACC=D90042.1; NID=g219415; PID=g219416; SEQTYPE=mRNA

6099

SEQUENCE ACC=D90040.1; NID=g219411; PID=g219412; SEQTYPE=mRNA

6100

SEQUENCE ACC=BC015878.1; NID=g16198419; PID=g16198420; SEQTYPE=mRNA

6101

SEQUENCE ACC=CR407631.1; NID=g47115198; PID=g47115199; SEQTYPE=mRNA

6102

SEQUENCE ACC=BG569293.1; NID=g13576946; CLONE=IMAGE:4722596;

6103

END=5'; LID=6989; SEQTYPE=EST; TRACE=44157214

6104

...

6105

SEQUENCE ACC=AU099534.1; NID=g13550663; CLONE=HSI08034; END=5';

6106

LID=8800; SEQTYPE=EST

6107

//

6108

>>

6109

6110

This particular record shows the set of transcripts (shown in the

6111

'SEQUENCE' lines) that originate from the human gene NAT2, encoding en

6112

N-acetyltransferase. The 'PROTSIM' lines show proteins with significant

6113

similarity to NAT2, whereas the 'STS' lines show the corresponding

6114

sequence-tagged sites in the genome.

6115

To parse UniGene files, use the 'Bio.UniGene' module:

6116

<<>>> from Bio import UniGene

6117

>>> input = open("myunigenefile.data")

6118

>>> record = UniGene.read(input)

6119

>>

6120

6121

The 'record' returned by 'UniGene.read' is a Python object with

6122

attributes corresponding to the fields in the UniGene record. For

6123

example,

6124

<<>>> record.ID

6125

"Hs.2"

6126

>>> record.title

6127

"N-acetyltransferase 2 (arylamine N-acetyltransferase)"

6128

>>

6129

6130

The 'EXPRESS' and 'RESTR_EXPR' lines are stored as Python lists of

6131

strings:

6132

<<['bone', 'connective tissue', 'intestine', 'liver', 'liver tumor',

6133

'normal', 'soft tissue/muscle tissue tumor', 'adult']

6134

>>

6135

6136

Specialized objects are returned for the 'STS', 'PROTSIM', and

6137

'SEQUENCE' lines, storing the keys shown in each line as attributes:

6138

<<>>> record.sts[0].acc

6139

'PMC310725P3'

6140

>>> record.sts[0].unists

6141

'272646'

6142

>>

6143

and similarly for the 'PROTSIM' and 'SEQUENCE' lines.

6144

To parse a file containing more than one UniGene record, use the

6145

'parse' function in 'Bio.UniGene':

6146

<<>>> from Bio import UniGene

6147

>>> input = open("unigenerecords.data")

6148

>>> records = UniGene.parse(input)

6149

>>> for record in records:

6150

... print record.ID

6151

>>

6152

6153

6154

6155

8.11 Using a proxy

6156

*=*=*=*=*=*=*=*=*=*

6157

6158

6159

Normally you won't have to worry about using a proxy, but if this is

6160

an issue on your network here is how to deal with it. Internally,

6161

'Bio.Entrez' uses the standard Python library 'urllib' for accessing the

6162

NCBI servers. This will check an environment variable called

6163

'http_proxy' to configure any simple proxy automatically. Unfortunately

6164

this module does not support the use of proxies which require

6165

authentication.

6166

You may choose to set the 'http_proxy' environment variable once (how

6167

you do this will depend on your operating system). Alternatively you can

6168

set this within Python at the start of your script, for example:

6169

<<import os

6170

os.environ["http_proxy"] = "http://proxyhost.example.com:8080"

6171

>>

6172

6173

See the urllib documentation (14) for more details.

6174

6175

6176

8.12 Examples

6177

*=*=*=*=*=*=*=

6178

6179

6180

6181

6182

8.12.1 PubMed and Medline

6183

==========================

6184

6185

If you are in the medical field or interested in human issues (and

6186

many times even if you are not!), PubMed

6187

(http://www.ncbi.nlm.nih.gov/PubMed/) is an excellent source of all

6188

kinds of goodies. So like other things, we'd like to be able to grab

6189

information from it and use it in Python scripts.

6190

In this example, we will query PubMed for all articles having to do

6191

with orchids (see section 2.3 for our motivation). We first check how

6192

many of such articles there are:

6193

<<>>> from Bio import Entrez

6194

>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who

6195

you are

6196

>>> handle = Entrez.egquery(term="orchid")

6197

>>> record = Entrez.read(handle)

6198

>>> for row in record["eGQueryResult"]:

6199

... if row["DbName"]=="pubmed":

6200

... print row["Count"]

6201

463

6202

>>

6203

6204

Now we use the 'Bio.Entrez.efetch' function to download the PubMed IDs

6205

of these 463 articles:

6206

<<>>> handle = Entrez.esearch(db="pubmed", term="orchid", retmax=463)

6207

>>> record = Entrez.read(handle)

6208

>>> idlist = record["IdList"]

6209

>>> print idlist

6210

>>

6211

6212

This returns a Python list containing all of the PubMed IDs of

6213

articles related to orchids:

6214

<<['18680603', '18665331', '18661158', '18627489', '18627452',

6215

'18612381',

6216

'18594007', '18591784', '18589523', '18579475', '18575811',

6217

'18575690',

6218

...

6219

>>

6220

6221

Now that we've got them, we obviously want to get the corresponding

6222

Medline records and extract the information from them. Here, we'll

6223

download the Medline records in the Medline flat-file format, and use

6224

the 'Bio.Medline' module to parse them:

6225

<<>>> from Bio import Medline

6226

>>> handle = Entrez.efetch(db="pubmed", id=idlist, rettype="medline",

6227

retmode="text")

6228

>>> records = Medline.parse(handle)

6229

>>

6230

6231

NOTE - We've just done a separate search and fetch here, the NCBI much

6232

prefer you to take advantage of their history support in this situation.

6233

See Section 8.13.

6234

Keep in mind that 'records' is an iterator, so you can iterate through

6235

the records only once. If you want to save the records, you can convert

6236

them to a list:

6237

<<>>> records = list(records)

6238

>>

6239

6240

Let's now iterate over the records to print out some information about

6241

each record:

6242

<<>>> for record in records:

6243

... print "title:", record["TI"]

6244

... if "AU" in records:

6245

... print "authors:", record["AU"]

6246

... print "source:", record["CO"]

6247

... print

6248

>>

6249

6250

The output for this looks like:

6251

<<title: Sex pheromone mimicry in the early spider orchid (ophrys

6252

sphegodes):

6253

patterns of hydrocarbons as the key mechanism for pollination by

6254

sexual

6255

deception [In Process Citation]

6256

authors: ['Schiestl FP', 'Ayasse M', 'Paulus HF', 'Lofstedt C',

6257

'Hansson BS',

6258

'Ibarra F', 'Francke W']

6259

source: J Comp Physiol [A] 2000 Jun;186(6):567-74

6260

>>

6261

6262

Especially interesting to note is the list of authors, which is

6263

returned as a standard Python list. This makes it easy to manipulate and

6264

search using standard Python tools. For instance, we could loop through

6265

a whole bunch of entries searching for a particular author with code

6266

like the following:

6267

<<>>> search_author = "Waits T"

6268

6269

>>> for record in records:

6270

... if not "AU" in record:

6271

... continue

6272

... if search_author in record[" AU"]:

6273

... print "Author %s found: %s" % (search_author,

6274

record["SO"])

6275

>>

6276

6277

Hopefully this section gave you an idea of the power and flexibility

6278

of the Entrez and Medline interfaces and how they can be used together.

6279

6280

6281

8.12.2 Searching, downloading, and parsing Entrez Nucleotide records

6282

=====================================================================

6283

6284

Here we'll show a simple example of performing a remote Entrez query.

6285

In section 2.3 of the parsing examples, we talked about using NCBI's

6286

Entrez website to search the NCBI nucleotide databases for info on

6287

Cypripedioideae, our friends the lady slipper orchids. Now, we'll look

6288

at how to automate that process using a Python script. In this example,

6289

we'll just show how to connect, get the results, and parse them, with

6290

the Entrez module doing all of the work.

6291

First, we use EGQuery to find out the number of results we will get

6292

before actually downloading them. EGQuery will tell us how many search

6293

results were found in each of the databases, but for this example we are

6294

only interested in nucleotides:

6295

<<>>> from Bio import Entrez

6296

>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who

6297

you are

6298

>>> handle = Entrez.egquery(term="Cypripedioideae")

6299

>>> record = Entrez.read(handle)

6300

>>> for row in record["eGQueryResult"]:

6301

... if row["DbName"]=="nuccore":

6302

... print row["Count"]

6303

814

6304

>>

6305

6306

So, we expect to find 814 Entrez Nucleotide records (this is the

6307

number I obtained in 2008; it is likely to increase in the future). If

6308

you find some ridiculously high number of hits, you may want to

6309

reconsider if you really want to download all of them, which is our next

6310

step:

6311

<<>>> from Bio import Entrez

6312

>>> handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae",

6313

retmax=814)

6314

>>> record = Entrez.read(handle)

6315

>>

6316

6317

Here, 'record' is a Python dictionary containing the search results

6318

and some auxiliary information. Just for information, let's look at what

6319

is stored in this dictionary:

6320

<<>>> print record.keys()

6321

[u'Count', u'RetMax', u'IdList', u'TranslationSet', u'RetStart',

6322

u'QueryTranslation']

6323

>>

6324

First, let's check how many results were found:

6325

<<>>> print record["Count"]

6326

'814'

6327

>>

6328

which is the number we expected. The 814 results are stored in

6329

'record['IdList']':

6330

<<>>> print len(record["IdList"])

6331

814

6332

>>

6333

Let's look at the first five results:

6334

<<>>> print record["IdList"][:5]

6335

['187237168', '187372713', '187372690', '187372688', '187372686']

6336

>>

6337

6338

We can download these records using 'efetch'. While you could

6339

download these records one by one, to reduce the load on NCBI's servers,

6340

it is better to fetch a bunch of records at the same time, shown below.

6341

However, in this situation you should ideally be using the history

6342

feature described later in Section 8.13.

6343

<<>>> idlist = ",".join(record["IdList"][:5])

6344

>>> print idlist

6345

187237168,187372713,187372690,187372688,187372686

6346

>>> handle = Entrez.efetch(db="nucleotide", id=idlist, retmode="xml")

6347

>>> records = Entrez.read(handle)

6348

>>> print len(records)

6349

5

6350

>>

6351

Each of these records corresponds to one GenBank record.

6352

<<>>> print records[0].keys()

6353

[u'GBSeq_moltype', u'GBSeq_source', u'GBSeq_sequence',

6354

u'GBSeq_primary-accession', u'GBSeq_definition',

6355

u'GBSeq_accession-version',

6356

u'GBSeq_topology', u'GBSeq_length', u'GBSeq_feature-table',

6357

u'GBSeq_create-date', u'GBSeq_other-seqids', u'GBSeq_division',

6358

u'GBSeq_taxonomy', u'GBSeq_references', u'GBSeq_update-date',

6359

u'GBSeq_organism', u'GBSeq_locus', u'GBSeq_strandedness']

6360

6361

>>> print records[0]["GBSeq_primary-accession"]

6362

DQ110336

6363

6364

>>> print records[0]["GBSeq_other-seqids"]

6365

['gb|DQ110336.1|', 'gi|187237168']

6366

6367

>>> print records[0]["GBSeq_definition"]

6368

Cypripedium calceolus voucher Davis 03-03 A maturase (matR) gene,

6369

partial cds;

6370

mitochondrial

6371

6372

>>> print records[0]["GBSeq_organism"]

6373

Cypripedium calceolus

6374

>>

6375

6376

You could use this to quickly set up searches -- but for heavy usage,

6377

see Section 8.13.

6378

6379

6380

8.12.3 Searching, downloading, and parsing GenBank records

6381

===========================================================

6382

6383

The GenBank record format is a very popular method of holding

6384

information about sequences, sequence features, and other associated

6385

sequence information. The format is a good way to get information from

6386

the NCBI databases at http://www.ncbi.nlm.nih.gov/.

6387

In this example we'll show how to query the NCBI databases,to retrieve

6388

the records from the query, and then parse them using 'Bio.SeqIO' -

6389

something touched on in Section 5.2.1. For simplicity, this example does

6390

not take advantage of the WebEnv history feature -- see Section 8.13 for

6391

this.

6392

First, we want to make a query and find out the ids of the records to

6393

retrieve. Here we'll do a quick search for one of our favorite

6394

organisms, Opuntia (prickly-pear cacti). We can do quick search and get

6395

back the GIs (GenBank identifiers) for all of the corresponding records.

6396

First we check how many records there are:

6397

<<>>> from Bio import Entrez

6398

>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who

6399

you are

6400

>>> handle = Entrez.egquery(term="Opuntia AND rpl16")

6401

>>> record = Entrez.read(handle)

6402

>>> for row in record["eGQueryResult"]:

6403

... if row["DbName"]=="nuccore":

6404

... print row["Count"]

6405

...

6406

9

6407

>>

6408

Now we download the list of GenBank identifiers:

6409

<<>>> handle = Entrez.esearch(db="nuccore", term="Opuntia AND rpl16")

6410

>>> record = Entrez.read(handle)

6411

>>> gi_list = record["IdList"]

6412

>>> gi_list

6413

['57240072', '57240071', '6273287', '6273291', '6273290', '6273289',

6414

'6273286',

6415

'6273285', '6273284']

6416

>>

6417

6418

Now we use these GIs to download the GenBank records - note that you

6419

have to supply a comma separated list of GI numbers to Entrez:

6420

<<>>> gi_str = ",".join(gi_list)

6421

>>> handle = Entrez.efetch(db="nuccore", id=gi_str, rettype="gb")

6422

>>

6423

6424

If you want to look at the raw GenBank files, you can read from this

6425

handle and print out the result:

6426

<<>>> text = handle.read()

6427

>>> print text

6428

LOCUS AY851612 892 bp DNA linear PLN

6429

10-APR-2007

6430

DEFINITION Opuntia subulata rpl16 gene, intron; chloroplast.

6431

ACCESSION AY851612

6432

VERSION AY851612.1 GI:57240072

6433

KEYWORDS .

6434

SOURCE chloroplast Austrocylindropuntia subulata

6435

ORGANISM Austrocylindropuntia subulata

6436

Eukaryota; Viridiplantae; Streptophyta; Embryophyta;

6437

Tracheophyta;

6438

Spermatophyta; Magnoliophyta; eudicotyledons; core

6439

eudicotyledons;

6440

Caryophyllales; Cactaceae; Opuntioideae;

6441

Austrocylindropuntia.

6442

REFERENCE 1 (bases 1 to 892)

6443

AUTHORS Butterworth,C.A. and Wallace,R.S.

6444

...

6445

>>

6446

6447

In this case, we are just getting the raw records. To get the records

6448

in a more Python-friendly form, we can use 'Bio.SeqIO' to parse the

6449

GenBank data into 'SeqRecord' objects, including 'SeqFeature' objects

6450

(see Chapter 5):

6451

<<>>> from Bio import SeqIO

6452

>>> handle = Entrez.efetch(db="nuccore", id=gi_str, rettype="gb")

6453

>>> records = SeqIO.parse(handle, "gb")

6454

>>

6455

6456

We can now step through the records and look at the information we are

6457

interested in:

6458

<<>>> for record in records:

6459

>>> ... print "%s, length %i, with %i features" \

6460

>>> ... % (record.name, len(record), len(record.features))

6461

AY851612, length 892, with 3 features

6462

AY851611, length 881, with 3 features

6463

AF191661, length 895, with 3 features

6464

AF191665, length 902, with 3 features

6465

AF191664, length 899, with 3 features

6466

AF191663, length 899, with 3 features

6467

AF191660, length 893, with 3 features

6468

AF191659, length 894, with 3 features

6469

AF191658, length 896, with 3 features

6470

>>

6471

6472

Using these automated query retrieval functionality is a big plus over

6473

doing things by hand. Although the module should obey the NCBI's max

6474

three queries per second rule, the NCBI have other recommendations like

6475

avoiding peak hours. See Section 8.1. In particular, please note that

6476

for simplicity, this example does not use the WebEnv history feature.

6477

You should use this for any non-trivial search and download work, see

6478

Section 8.13.

6479

Finally, if plan to repeat your analysis, rather than downloading the

6480

files from the NCBI and parsing them immediately (as shown in this

6481

example), you should just download the records once and save them to

6482

your hard disk, and then parse the local file.

6483

6484

6485

8.12.4 Finding the lineage of an organism

6486

==========================================

6487

6488

Staying with a plant example, let's now find the lineage of the

6489

Cypripedioideae orchid family. First, we search the Taxonomy database

6490

for Cypripedioideae, which yields exactly one NCBI taxonomy identifier:

6491

<<>>> from Bio import Entrez

6492

>>> Entrez.email = "A.N.Other@example.com" # Always tell NCBI who

6493

you are

6494

>>> handle = Entrez.esearch(db="Taxonomy", term="Cypripedioideae")

6495

>>> record = Entrez.read(handle)

6496

>>> record["IdList"]

6497

['158330']

6498

>>> record["IdList"][0]

6499

'158330'

6500

>>

6501

Now, we use 'efetch' to download this entry in the Taxonomy database,

6502

and then parse it:

6503

<<>>> handle = Entrez.efetch(db="Taxonomy", id="158330", retmode="xml")

6504

>>> records = Entrez.read(handle)

6505

>>

6506

Again, this record stores lots of information:

6507

<<>>> records[0].keys()

6508

[u'Lineage', u'Division', u'ParentTaxId', u'PubDate', u'LineageEx',

6509

u'CreateDate', u'TaxId', u'Rank', u'GeneticCode', u'ScientificName',

6510

u'MitoGeneticCode', u'UpdateDate']

6511

>>

6512

We can get the lineage directly from this record:

6513

<<>>> records[0]["Lineage"]

6514

'cellular organisms; Eukaryota; Viridiplantae; Streptophyta;

6515

Streptophytina;

6516

Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta;

6517

Magnoliophyta;

6518

Liliopsida; Asparagales; Orchidaceae'

6519

>>

6520

6521

The record data contains much more than just the information shown

6522

here - for example look under "LineageEx" instead of "Lineage" and

6523

you'll get the NCBI taxon identifiers of the lineage entries too.

6524

6525

6526

8.13 Using the history and WebEnv

6527

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

6528

6529

6530

Often you will want to make a series of linked queries. Most

6531

typically, running a search, perhaps refining the search, and then

6532

retrieving detailed search results. You can do this by making a series

6533

of separate calls to Entrez. However, the NCBI prefer you to take

6534

advantage of their history support - for example combining ESearch and

6535

EFetch.

6536

Another typical use of the history support would be to combine EPost

6537

and EFetch. You use EPost to upload a list of identifiers, which starts

6538

a new history session. You then download the records with EFetch by

6539

referring to the session (instead of the identifiers).

6540

6541

6542

8.13.1 Searching for and downloading sequences using the history

6543

=================================================================

6544

Suppose we want to search and download all the Opuntia rpl16

6545

nucleotide sequences, and store them in a FASTA file. As shown in

6546

Section 8.12.3, we can naively combine 'Bio.Entrez.esearch()' to get a

6547

list of GI numbers, and then call 'Bio.Entrez.efetch()' to download them

6548

all.

6549

However, the approved approach is to run the search with the history

6550

feature. Then, we can fetch the results by reference to the search

6551

results - which the NCBI can anticipate and cache.

6552

To do this, call 'Bio.Entrez.esearch()' as normal, but with the

6553

additional argument of 'usehistory="y"',

6554

<<>>> from Bio import Entrez

6555

>>> Entrez.email = "history.user@example.com"

6556

>>> search_handle = Entrez.esearch(db="nucleotide",term="Opuntia[orgn]

6557

and rpl16",

6558

usehistory="y")

6559

>>> search_results = Entrez.read(search_handle)

6560

>>> search_handle.close()

6561

>>

6562

6563

When you get the XML output back, it will still include the usual

6564

search results:

6565

<<>>> gi_list = search_results["IdList"]

6566

>>> count = int(search_results["Count"])

6567

>>> assert count == len(gi_list)

6568

>>

6569

6570

However, you also get given two additional pieces of information, the

6571

WebEnv session cookie, and the QueryKey:

6572

<<>>> webenv = search_results["WebEnv"]

6573

>>> query_key = search_results["QueryKey"]

6574

>>

6575

6576

Having stored these values in variables session_cookie and query_key

6577

we can use them as parameters to 'Bio.Entrez.efetch()' instead of giving

6578

the GI numbers as identifiers.

6579

While for small searches you might be OK downloading everything at

6580

once, its better download in batches. You use the retstart and retmax

6581

parameters to specify which range of search results you want returned

6582

(starting entry using zero-based counting, and maximum number of results

6583

to return). For example,

6584

<<batch_size = 3

6585

out_handle = open("orchid_rpl16.fasta", "w")

6586

for start in range(0,count,batch_size) :

6587

end = min(count, start+batch_size)

6588

print "Going to download record %i to %i" % (start+1, end)

6589

fetch_handle = Entrez.efetch(db="nucleotide", rettype="fasta",

6590

retstart=start, retmax=batch_size,

6591

webenv=webenv, query_key=query_key)

6592

data = fetch_handle.read()

6593

fetch_handle.close()

6594

out_handle.write(data)

6595

out_handle.close()

6596

>>

6597

6598

For illustrative purposes, this example downloaded the FASTA records

6599

in batches of three. Unless you are downloading genomes or chromosomes,

6600

you would normally pick a larger batch size.

6601

6602

6603

8.13.2 Searching for and downloading abstracts using the history

6604

=================================================================

6605

Here is another history example, searching for papers published in

6606

the last year about the Opuntia, and then downloading them into a file

6607

in MedLine format:

6608

<<from Bio import Entrez

6609

Entrez.email = "history.user@example.com"

6610

search_results = Entrez.read(Entrez.esearch(db="pubmed",

6611

term="Opuntia[ORGN]",

6612

reldate=365,

6613

datetype="pdat",

6614

usehistory="y"))

6615

count = int(search_results["Count"])

6616

print "Found %i results" % count

6617

6618

batch_size = 10

6619

out_handle = open("recent_orchid_papers.txt", "w")

6620

for start in range(0,count,batch_size) :

6621

end = min(count, start+batch_size)

6622

print "Going to download record %i to %i" % (start+1, end)

6623

fetch_handle = Entrez.efetch(db="pubmed", rettype="medline",

6624

retstart=start, retmax=batch_size,

6625

webenv=search_results["WebEnv"],

6626

query_key=search_results["QueryKey"])

6627

data = fetch_handle.read()

6628

fetch_handle.close()

6629

out_handle.write(data)

6630

out_handle.close()

6631

>>

6632

6633

At the time of writing, this gave 28 matches - but because this is a

6634

date dependent search, this will of course vary. As described in

6635

Section 8.10.1 above, you can then use 'Bio.Medline' to parse the saved

6636

records.

6637

And finally, don't forget to include your own email address in the

6638

Entrez calls.

6639

-----------------------------------

6640

6641

6642

(1) http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#Us

6643

erSystemRequirements

6644

6645

(2) http://www.ncbi.nlm.nih.gov/entrez/query/static/esearch_help.html

6646

6647

(3) http://www.ncbi.nlm.nih.gov/entrez/query/static/epost_help.html

6648

6649

(4) http://www.ncbi.nlm.nih.gov/entrez/query/static/esummary_help.html

6650

6651

(5) http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html

6652

6653

(6) http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html

6654

6655

(7) http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html

6656

6657

(8) http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html

6658

6659

(9) http://www.ncbi.nlm.nih.gov/entrez/query/static/elink_help.html

6660

6661

(10) http://www.ncbi.nlm.nih.gov/entrez/query/static/egquery_help.html

6662

6663

(11) http://www.ncbi.nlm.nih.gov/entrez/query/static/espell_help.html

6664

6665

(12) http://www.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html

6666

6667

(13) http://www.ncbi.nlm.nih.gov/geo/

6668

6669

(14) http://www.python.org/doc/lib/module-urllib.html

6670

6671

6672

Chapter 9 Swiss-Prot and ExPASy

6673

**********************************

6674

6675

6676

6677

9.1 Parsing Swiss-Prot files

6678

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

6679

6680

6681

Swiss-Prot (http://www.expasy.org/sprot) is a hand-curated database of

6682

protein sequences. Biopython can parse the "plain text" Swiss-Prot file

6683

format, which is still used for the UniProt Knowledgebase which combined

6684

Swiss-Prot, TrEMBL and PIR-PSD. We do not (yet) support the UniProtKB

6685

XML file format.

6686

6687

6688

9.1.1 Parsing Swiss-Prot records

6689

=================================

6690

6691

In Section 5.2.2, we described how to extract the sequence of a

6692

Swiss-Prot record as a 'SeqRecord' object. Alternatively, you can store

6693

the Swiss-Prot record in a 'Bio.SwissProt.Record' object, which in fact

6694

stores the complete information contained in the Swiss-Prot record. In

6695

this Section, we describe how to extract 'Bio.SwissProt.Record' objects

6696

from a Swiss-Prot file.

6697

To parse a Swiss-Prot record, we first get a handle to a Swiss-Prot

6698

record. There are several ways to do so, depending on where and how the

6699

Swiss-Prot record is stored:

6700

6701

- Open a Swiss-Prot file locally:

6702

'>>> handle = open("myswissprotfile.dat")'

6703

- Open a gzipped Swiss-Prot file:

6704

<<>>> import gzip

6705

>>> handle = gzip.open("myswissprotfile.dat.gz")

6706

>>

6707

6708

- Open a Swiss-Prot file over the internet:

6709

<<>>> import urllib

6710

>>> handle =

6711

urllib.urlopen("http://www.somelocation.org/data/someswissprotfile.da

6712

t")

6713

>>

6714

6715

- Open a Swiss-Prot file over the internet from the ExPASy database

6716

(see section 9.5.1):

6717

<<>>> from Bio import ExPASy

6718

>>> handle = ExPASy.get_sprot_raw(myaccessionnumber)

6719

>>

6720

The key point is that for the parser, it doesn't matter how the

6721

handle was created, as long as it points to data in the Swiss-Prot

6722

format.

6723

We can use 'Bio.SeqIO' as described in Section 5.2.2 to get file

6724

format agnostic 'SeqRecord' objects. Alternatively, we can use

6725

'Bio.SwissProt' get 'Bio.SwissProt.Record' objects, which are a much

6726

closer match to the underlying file format.

6727

To read one Swiss-Prot record from the handle, we use the function

6728

'read()':

6729

<<>>> from Bio import SwissProt

6730

>>> record = SwissProt.read(handle)

6731

>>

6732

This function should be used if the handle points to exactly one

6733

Swiss-Prot record. It raises a 'ValueError' if no Swiss-Prot record was

6734

found, and also if more than one record was found.

6735

We can now print out some information about this record:

6736

<<>>> print record.description

6737

'RecName: Full=Chalcone synthase 3; EC=2.3.1.74; AltName:

6738

Full=Naringenin-chalcone synthase 3;'

6739

>>> for ref in record.references:

6740

... print "authors:", ref.authors

6741

... print "title:", ref.title

6742

...

6743

authors: Liew C.F., Lim S.H., Loh C.S., Goh C.J.;

6744

title: "Molecular cloning and sequence analysis of chalcone synthase

6745

cDNAs of

6746

Bromheadia finlaysoniana.";

6747

>>> print record.organism_classification

6748

['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', ...,

6749

'Bromheadia']

6750

>>

6751

6752

To parse a file that contains more than one Swiss-Prot record, we use

6753

the 'parse' function instead. This function allows us to iterate over

6754

the records in the file.

6755

For example, let's parse the full Swiss-Prot database and collect all

6756

the descriptions. You can download this from the ExPAYs FTP site (1) as

6757

a single gzipped-file 'uniprot_sprot.dat.gz' (about 300MB). This is a

6758

compressed file containing a single file, 'uniprot_sprot.dat' (over

6759

1.5GB).

6760

As described at the start of this section, you can use the Python

6761

library 'gzip' to open and uncompress a .gz file, like this:

6762

<<>>> import gzip

6763

>>> handle = gzip.open("uniprot_sprot.dat.gz")

6764

>>

6765

6766

However, uncompressing a large file takes time, and each time you open

6767

the file for reading in this way, it has to be decompressed on the fly.

6768

So, if you can spare the disk space you'll save time in the long run if

6769

you first decompress the file to disk, to get the 'uniprot_sprot.dat'

6770

file inside. Then you can open the file for reading as usual:

6771

<<>>> handle = open("uniprot_sprot.dat")

6772

>>

6773

6774

As of June 2009, the full Swiss-Prot database downloaded from ExPASy

6775

contained 468851 Swiss-Prot records. One concise way to build up a list

6776

of the record descriptions is with a list comprehension:

6777

<<>>> from Bio import SwissProt

6778

>>> handle = open("uniprot_sprot.dat")

6779

>>> descriptions = [record.description for record in

6780

SwissProt.parse(handle)]

6781

>>> len(descriptions)

6782

468851

6783

>>> descriptions[:5]

6784

['RecName: Full=Protein MGF 100-1R;',

6785

'RecName: Full=Protein MGF 100-1R;',

6786

'RecName: Full=Protein MGF 100-1R;',

6787

'RecName: Full=Protein MGF 100-1R;',

6788

'RecName: Full=Protein MGF 100-2L;']

6789

6790

>>

6791

6792

Or, using a for loop over the record iterator:

6793

<<>>> from Bio import SwissProt

6794

>>> descriptions = []

6795

>>> handle = open("uniprot_sprot.dat")

6796

>>> for record in SwissProt.parse(handle) :

6797

... descriptions.append(record.description)

6798

...

6799

>>> len(descriptions)

6800

468851

6801

>>

6802

6803

Because this is such a large input file, either way takes about eleven

6804

minutes on my new desktop computer (using the uncompressed

6805

'uniprot_sprot.dat' file as input).

6806

It is equally easy to extract any kind of information you'd like from

6807

Swiss-Prot records. To see the members of a Swiss-Prot record, use

6808

<<>>> dir(record)

6809

['__doc__', '__init__', '__module__', 'accessions',

6810

'annotation_update',

6811

'comments', 'created', 'cross_references', 'data_class',

6812

'description',

6813

'entry_name', 'features', 'gene_name', 'host_organism', 'keywords',

6814

'molecule_type', 'organelle', 'organism', 'organism_classification',

6815

'references', 'seqinfo', 'sequence', 'sequence_length',

6816

'sequence_update', 'taxonomy_id']

6817

>>

6818

6819

6820

6821

9.1.2 Parsing the Swiss-Prot keyword and category list

6822

=======================================================

6823

6824

Swiss-Prot also distributes a file 'keywlist.txt', which lists the

6825

keywords and categories used in Swiss-Prot. The file contains entries in

6826

the following form:

6827

<<ID 2Fe-2S.

6828

AC KW-0001

6829

DE Protein which contains at least one 2Fe-2S iron-sulfur cluster: 2

6830

iron

6831

DE atoms complexed to 2 inorganic sulfides and 4 sulfur atoms of

6832

DE cysteines from the protein.

6833

SY Fe2S2; [2Fe-2S] cluster; [Fe2S2] cluster; Fe2/S2 (inorganic)

6834

cluster;

6835

SY Di-mu-sulfido-diiron; 2 iron, 2 sulfur cluster binding.

6836

GO GO:0051537; 2 iron, 2 sulfur cluster binding

6837

HI Ligand: Iron; Iron-sulfur; 2Fe-2S.

6838

HI Ligand: Metal-binding; 2Fe-2S.

6839

CA Ligand.

6840

//

6841

ID 3D-structure.

6842

AC KW-0002

6843

DE Protein, or part of a protein, whose three-dimensional structure

6844

has

6845

DE been resolved experimentally (for example by X-ray

6846

crystallography or

6847

DE NMR spectroscopy) and whose coordinates are available in the PDB

6848

DE database. Can also be used for theoretical models.

6849

HI Technical term: 3D-structure.

6850

CA Technical term.

6851

//

6852

ID 3Fe-4S.

6853

...

6854

>>

6855

6856

The entries in this file can be parsed by the 'parse' function in the

6857

'Bio.SwissProt.KeyWList' module. Each entry is then stored as a

6858

'Bio.SwissProt.KeyWList.Record', which is a Python dictionary.

6859

<<>>> from Bio.SwissProt import KeyWList

6860

>>> handle = open("keywlist.txt")

6861

>>> records = KeyWList.parse(handle)

6862

>>> for record in records:

6863

... print record['ID']

6864

... print record['DE']

6865

>>

6866

6867

This prints

6868

<<2Fe-2S.

6869

Protein which contains at least one 2Fe-2S iron-sulfur cluster: 2 iron

6870

atoms

6871

complexed to 2 inorganic sulfides and 4 sulfur atoms of cysteines from

6872

the

6873

protein.

6874

...

6875

>>

6876

6877

6878

6879

9.2 Parsing Prosite records

6880

*=*=*=*=*=*=*=*=*=*=*=*=*=*=

6881

6882

6883

Prosite is a database containing protein domains, protein families,

6884

functional sites, as well as the patterns and profiles to recognize

6885

them. Prosite was developed in parallel with Swiss-Prot. In Biopython, a

6886

Prosite record is represented by the 'Bio.ExPASy.Prosite.Record' class,

6887

whose members correspond to the different fields in a Prosite record.

6888

In general, a Prosite file can contain more than one Prosite records.

6889

For example, the full set of Prosite records, which can be downloaded as

6890

a single file ('prosite.dat') from the ExPASy FTP site (2), contains

6891

2073 records (version 20.24 released on 4 December 2007). To parse such

6892

a file, we again make use of an iterator:

6893

<<>>> from Bio.ExPASy import Prosite

6894

>>> handle = open("myprositefile.dat")

6895

>>> records = Prosite.parse(handle)

6896

>>

6897

6898

We can now take the records one at a time and print out some

6899

information. For example, using the file containing the complete Prosite

6900

database, we'd find

6901

<<>>> from Bio.ExPASy import Prosite

6902

>>> handle = open("prosite.dat")

6903

>>> records = Prosite.parse(handle)

6904

>>> record = records.next()

6905

>>> record.accession

6906

'PS00001'

6907

>>> record.name

6908

'ASN_GLYCOSYLATION'

6909

>>> record.pdoc

6910

'PDOC00001'

6911

>>> record = records.next()

6912

>>> record.accession

6913

'PS00004'

6914

>>> record.name

6915

'CAMP_PHOSPHO_SITE'

6916

>>> record.pdoc

6917

'PDOC00004'

6918

>>> record = records.next()

6919

>>> record.accession

6920

'PS00005'

6921

>>> record.name

6922

'PKC_PHOSPHO_SITE'

6923

>>> record.pdoc

6924

'PDOC00005'

6925

>>

6926

and so on. If you're interested in how many Prosite records there are,

6927

you could use

6928

<<>>> from Bio.ExPASy import Prosite

6929

>>> handle = open("prosite.dat")

6930

>>> records = Prosite.parse(handle)

6931

>>> n = 0

6932

>>> for record in records: n+=1

6933

...

6934

>>> print n

6935

2073

6936

>>

6937

6938

To read exactly one Prosite from the handle, you can use the 'read'

6939

function:

6940

<<>>> from Bio.ExPASy import Prosite

6941

>>> handle = open("mysingleprositerecord.dat")

6942

>>> record = Prosite.read(handle)

6943

>>

6944

This function raises a ValueError if no Prosite record is found, and

6945

also if more than one Prosite record is found.

6946

6947

6948

9.3 Parsing Prosite documentation records

6949

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

6950

6951

6952

In the Prosite example above, the 'record.pdoc' accession numbers

6953

''PDOC00001'', ''PDOC00004'', ''PDOC00005'' and so on refer to Prosite

6954

documentation. The Prosite documentation records are available from

6955

ExPASy as individual files, and as one file ('prosite.doc') containing

6956

all Prosite documentation records.

6957

We use the parser in 'Bio.ExPASy.Prodoc' to parse Prosite

6958

documentation records. For example, to create a list of all accession

6959

numbers of Prosite documentation record, you can use

6960

<<>>> from Bio.ExPASy import Prodoc

6961

>>> handle = open("prosite.doc")

6962

>>> records = Prodoc.parse(handle)

6963

>>> accessions = [record.accession for record in records]

6964

>>

6965

6966

Again a 'read()' function is provided to read exactly one Prosite

6967

documentation record from the handle.

6968

6969

6970

9.4 Parsing Enzyme records

6971

*=*=*=*=*=*=*=*=*=*=*=*=*=*

6972

6973

6974

ExPASy's Enzyme database is a repository of information on enzyme

6975

nomenclature. A typical Enzyme record looks as follows:

6976

<<ID 3.1.1.34

6977

DE Lipoprotein lipase.

6978

AN Clearing factor lipase.

6979

AN Diacylglycerol lipase.

6980

AN Diglyceride lipase.

6981

CA Triacylglycerol + H(2)O = diacylglycerol + a carboxylate.

6982

CC -!- Hydrolyzes triacylglycerols in chylomicrons and very

6983

low-density

6984

CC lipoproteins (VLDL).

6985

CC -!- Also hydrolyzes diacylglycerol.

6986

PR PROSITE; PDOC00110;

6987

DR P11151, LIPL_BOVIN ; P11153, LIPL_CAVPO ; P11602, LIPL_CHICK ;

6988

DR P55031, LIPL_FELCA ; P06858, LIPL_HUMAN ; P11152, LIPL_MOUSE ;

6989

DR O46647, LIPL_MUSVI ; P49060, LIPL_PAPAN ; P49923, LIPL_PIG ;

6990

DR Q06000, LIPL_RAT ; Q29524, LIPL_SHEEP ;

6991

//

6992

>>

6993

6994

In this example, the first line shows the EC (Enzyme Commission)

6995

number of lipoprotein lipase (second line). Alternative names of

6996

lipoprotein lipase are "clearing factor lipase", "diacylglycerol

6997

lipase", and "diglyceride lipase" (lines 3 through 5). The line starting

6998

with "CA" shows the catalytic activity of this enzyme. Comment lines

6999

start with "CC". The "PR" line shows references to the Prosite

7000

Documentation records, and the "DR" lines show references to Swiss-Prot

7001

records. Not of these entries are necessarily present in an Enzyme

7002

record.

7003

In Biopython, an Enzyme record is represented by the

7004

'Bio.ExPASy.Enzyme.Record' class. This record derives from a Python

7005

dictionary and has keys corresponding to the two-letter codes used in

7006

Enzyme files. To read an Enzyme file containing one Enzyme record, use

7007

the 'read' function in 'Bio.ExPASy.Enzyme':

7008

<<>>> from Bio.ExPASy import Enzyme

7009

>>> handle = open("lipoprotein.txt")

7010

>>> record = Enzyme.read(handle)

7011

>>> record["ID"]

7012

'3.1.1.34'

7013

>>> record["DE"]

7014

'Lipoprotein lipase.'

7015

>>> record["AN"]

7016

['Clearing factor lipase.', 'Diacylglycerol lipase.', 'Diglyceride

7017

lipase.']

7018

>>> record["CA"]

7019

'Triacylglycerol + H(2)O = diacylglycerol + a carboxylate.'

7020

>>> record["CC"]

7021

['Hydrolyzes triacylglycerols in chylomicrons and very low-density

7022

lipoproteins

7023

(VLDL).', 'Also hydrolyzes diacylglycerol.']

7024

>>> record["PR"]

7025

['PDOC00110']

7026

>>> record["DR"]

7027

[['P11151', 'LIPL_BOVIN'], ['P11153', 'LIPL_CAVPO'], ['P11602',

7028

'LIPL_CHICK'],

7029

['P55031', 'LIPL_FELCA'], ['P06858', 'LIPL_HUMAN'], ['P11152',

7030

'LIPL_MOUSE'],

7031

['O46647', 'LIPL_MUSVI'], ['P49060', 'LIPL_PAPAN'], ['P49923',

7032

'LIPL_PIG'],

7033

['Q06000', 'LIPL_RAT'], ['Q29524', 'LIPL_SHEEP']]

7034

>>

7035

The 'read' function raises a ValueError if no Enzyme record is found,

7036

and also if more than one Enzyme record is found.

7037

The full set of Enzyme records can be downloaded as a single file

7038

('enzyme.dat') from the ExPASy FTP site (3), containing 4877 records

7039

(release of 3 March 2009). To parse such a file containing multiple

7040

Enzyme records, use the 'parse' function in 'Bio.ExPASy.Enzyme' to

7041

obtain an iterator:

7042

<<>>> from Bio.ExPASy import Enzyme

7043

>>> handle = open("enzyme.dat")

7044

>>> records = Enzyme.parse(handle)

7045

>>

7046

7047

We can now iterate over the records one at a time. For example, we can

7048

make a list of all EC numbers for which an Enzyme record is available:

7049

<<>>> ecnumbers = [record["ID"] for record in records]

7050

>>

7051

7052

7053

7054

9.5 Accessing the ExPASy server

7055

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

7056

7057

7058

Swiss-Prot, Prosite, and Prosite documentation records can be

7059

downloaded from the ExPASy web server at http://www.expasy.org. Six

7060

kinds of queries are available from ExPASy:

7061

7062

get_prodoc_entry To download a Prosite documentation record in HTML

7063

format

7064

get_prosite_entry To download a Prosite record in HTML format

7065

get_prosite_raw To download a Prosite or Prosite documentation record

7066

in raw format

7067

get_sprot_raw To download a Swiss-Prot record in raw format

7068

sprot_search_ful To search for a Swiss-Prot record

7069

sprot_search_de To search for a Swiss-Prot record

7070

To access this web server from a Python script, we use the

7071

'Bio.ExPASy' module.

7072

7073

7074

9.5.1 Retrieving a Swiss-Prot record

7075

=====================================

7076

7077

Let's say we are looking at chalcone synthases for Orchids (see

7078

section 2.3 for some justification for looking for interesting things

7079

about orchids). Chalcone synthase is involved in flavanoid biosynthesis

7080

in plants, and flavanoids make lots of cool things like pigment colors

7081

and UV protectants.

7082

If you do a search on Swiss-Prot, you can find three orchid proteins

7083

for Chalcone Synthase, id numbers O23729, O23730, O23731. Now, let's

7084

write a script which grabs these, and parses out some interesting

7085

information.

7086

First, we grab the records, using the 'get_sprot_raw()' function of

7087

'Bio.ExPASy'. This function is very nice since you can feed it an id and

7088

get back a handle to a raw text record (no html to mess with!). We can

7089

the use 'Bio.SwissProt.read' to pull out the Swiss-Prot record, or

7090

'Bio.SeqIO.read' to get a SeqRecord. The following code accomplishes

7091

what I just wrote:

7092

<<>>> from Bio import ExPASy

7093

>>> from Bio import SwissProt

7094

7095

>>> accessions = ["O23729", "O23730", "O23731"]

7096

>>> records = []

7097

7098

>>> for accession in accessions:

7099

... handle = ExPASy.get_sprot_raw(accession)

7100

... record = SwissProt.read(handle)

7101

... records.append(record)

7102

>>

7103

7104

If the accession number you provided to 'ExPASy.get_sprot_raw' does

7105

not exist, then 'SwissProt.read(handle)' will raise a 'ValueError'. You

7106

can catch 'ValueException' exceptions to detect invalid accession

7107

numbers:

7108

<<>>> for accession in accessions:

7109

... handle = ExPASy.get_sprot_raw(accession)

7110

... try:

7111

... record = SwissProt.read(handle)

7112

... except ValueException:

7113

... print "WARNING: Accession %s not found" % accession

7114

... records.append(record)

7115

>>

7116

7117

7118

7119

9.5.2 Searching Swiss-Prot

7120

===========================

7121

7122

Now, you may remark that I knew the records' accession numbers

7123

beforehand. Indeed, 'get_sprot_raw()' needs either the entry name or an

7124

accession number. When you don't have them handy, you can use one of the

7125

'sprot_search_de()' or 'sprot_search_ful()' functions.

7126

'sprot_search_de()' searches in the ID, DE, GN, OS and OG lines;

7127

'sprot_search_ful()' searches in (nearly) all the fields. They are

7128

detailed on http://www.expasy.org/cgi-bin/sprot-search-de and

7129

http://www.expasy.org/cgi-bin/sprot-search-ful respectively. Note that

7130

they don't search in TrEMBL by default (argument 'trembl'). Note also

7131

that they return html pages; however, accession numbers are quite easily

7132

extractable:

7133

<<>>> from Bio import ExPASy

7134

>>> import re

7135

7136

>>> handle = ExPASy.sprot_search_de("Orchid Chalcone Synthase")

7137

>>> # or:

7138

>>> # handle = ExPASy.sprot_search_ful("Orchid and {Chalcone

7139

Synthase}")

7140

>>> html_results = handle.read()

7141

>>> if "Number of sequences found" in html_results:

7142

... ids = re.findall(r'HREF="/uniprot/(\w+)"', html_results)

7143

... else:

7144

... ids = re.findall(r'href="/cgi-bin/niceprot\.pl\?(\w+)"',

7145

html_results)

7146

>>

7147

7148

7149

7150

9.5.3 Retrieving Prosite and Prosite documentation records

7151

===========================================================

7152

7153

Prosite and Prosite documentation records can be retrieved either in

7154

HTML format, or in raw format. To parse Prosite and Prosite

7155

documentation records with Biopython, you should retrieve the records in

7156

raw format. For other purposes, however, you may be interested in these

7157

records in HTML format.

7158

To retrieve a Prosite or Prosite documentation record in raw format,

7159

use 'get_prosite_raw()'. For example, to download a Prosite record and

7160

print it out in raw text format, use

7161

<<>>> from Bio import ExPASy

7162

>>> handle = ExPASy.get_prosite_raw('PS00001')

7163

>>> text = handle.read()

7164

>>> print text

7165

>>

7166

7167

To retrieve a Prosite record and parse it into a 'Bio.Prosite.Record'

7168

object, use

7169

<<>>> from Bio import ExPASy

7170

>>> from Bio import Prosite

7171

>>> handle = ExPASy.get_prosite_raw('PS00001')

7172

>>> record = Prosite.read(handle)

7173

>>

7174

7175

The same function can be used to retrieve a Prosite documentation

7176

record and parse it into a 'Bio.ExPASy.Prodoc.Record' object:

7177

<<>>> from Bio import ExPASy

7178

>>> from Bio.ExPASy import Prodoc

7179

>>> handle = ExPASy.get_prosite_raw('PDOC00001')

7180

>>> record = Prodoc.read(handle)

7181

>>

7182

7183

For non-existing accession numbers, 'ExPASy.get_prosite_raw' returns a

7184

handle to an emptry string. When faced with an empty string,

7185

'Prosite.read' and 'Prodoc.read' will raise a ValueError. You can catch

7186

these exceptions to detect invalid accession numbers.

7187

The functions 'get_prosite_entry()' and 'get_prodoc_entry()' are used

7188

to download Prosite and Prosite documentation records in HTML format. To

7189

create a web page showing one Prosite record, you can use

7190

<<>>> from Bio import ExPASy

7191

>>> handle = ExPASy.get_prosite_entry('PS00001')

7192

>>> html = handle.read()

7193

>>> output = open("myprositerecord.html", "w")

7194

>>> output.write(html)

7195

>>> output.close()

7196

>>

7197

7198

and similarly for a Prosite documentation record:

7199

<<>>> from Bio import ExPASy

7200

>>> handle = ExPASy.get_prodoc_entry('PDOC00001')

7201

>>> html = handle.read()

7202

>>> output = open("myprodocrecord.html", "w")

7203

>>> output.write(html)

7204

>>> output.close()

7205

>>

7206

7207

For these functions, an invalid accession number returns an error

7208

message in HTML format.

7209

7210

7211

9.6 Scanning the Prosite database

7212

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

7213

7214

7215

ScanProsite (4) allows you to scan protein sequences online against

7216

the Prosite database by providing a UniProt or PDB sequence identifier

7217

or the sequence itself. For more information about ScanProsite, please

7218

see the ScanProsite documentation (5) as well as the documentation for

7219

programmatic access of ScanProsite (6).

7220

You can use Biopython's 'Bio.ExPASy.ScanProsite' module to scan the

7221

Prosite database from Python. This module both helps you to access

7222

ScanProsite programmatically, and to parse the results returned by

7223

ScanProsite. To scan for Prosite patterns in the following protein

7224

sequence:

7225

<<MEHKEVVLLLLLFLKSGQGEPLDDYVNTQGASLFSVTKKQLGAGSIEECAAKCEEDEEFT

7226

CRAFQYHSKEQQCVIMAENRKSSIIIRMRDVVLFEKKVYLSECKTGNGKNYRGTMSKTKN

7227

>>

7228

7229

you can use the following code:

7230

<<>>> sequence =

7231

"MEHKEVVLLLLLFLKSGQGEPLDDYVNTQGASLFSVTKKQLGAGSIEECAAKCEEDEEFT

7232

CRAFQYHSKEQQCVIMAENRKSSIIIRMRDVVLFEKKVYLSECKTGNGKNYRGTMSKTKN"

7233

>>> from Bio.ExPASy import ScanProsite

7234

>>> handle = ScanProsite.scan(seq=sequence)

7235

>>

7236

7237

By executing 'handle.read()', you can obtain the search results in raw

7238

XML format. Instead, let's use 'Bio.ExPASy.ScanProsite.read' to parse

7239

the raw XML into a Python object:

7240

<<>>> result = ScanProsite.read(handle)

7241

>>> type(result)

7242

7243

>>

7244

7245

A 'Bio.ExPASy.ScanProsite.Record' object is derived from a list, with

7246

each element in the list storing one ScanProsite hit. This object also

7247

stores the number of hits, as well as the number of search sequences, as

7248

returned by ScanProsite. This ScanProsite search resulted in six hits:

7249

<<>>> result.n_seq

7250

1

7251

>>> result.n_match

7252

6

7253

>>> len(result)

7254

6

7255

>>> result[0]

7256

{'signature_ac': u'PS50948', 'level': u'0', 'stop': 98, 'sequence_ac':

7257

u'USERSEQ1', 'start': 16, 'score': u'8.873'}

7258

>>> result[1]

7259

{'start': 37, 'stop': 39, 'sequence_ac': u'USERSEQ1', 'signature_ac':

7260

u'PS00005'}

7261

>>> result[2]

7262

{'start': 45, 'stop': 48, 'sequence_ac': u'USERSEQ1', 'signature_ac':

7263

u'PS00006'}

7264

>>> result[3]

7265

{'start': 60, 'stop': 62, 'sequence_ac': u'USERSEQ1', 'signature_ac':

7266

u'PS00005'}

7267

>>> result[4]

7268

{'start': 80, 'stop': 83, 'sequence_ac': u'USERSEQ1', 'signature_ac':

7269

u'PS00004'}

7270

>>> result[5]

7271

{'start': 106, 'stop': 111, 'sequence_ac': u'USERSEQ1',

7272

'signature_ac': u'PS00008'}

7273

>>

7274

7275

Other ScanProsite parameters can be passed as keyword arguments; see

7276

the documentation for programmatic access of ScanProsite (7) for more

7277

information. As an example, passing 'lowscore=1' to include matches with

7278

low level scores lets use find one additional hit:

7279

<<>>> handle = ScanProsite.scan(seq=sequence, lowscore=1)

7280

>>> result = ScanProsite.read(handle)

7281

>>> result.n_match

7282

7

7283

>>

7284

7285

-----------------------------------

7286

7287

7288

(1) ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebas

7289

e/complete/uniprot_sprot.dat.gz

7290

7291

(2) ftp://ftp.expasy.org/databases/prosite/prosite.dat

7292

7293

(3) ftp://ftp.expasy.org/databases/enzyme/enzyme.dat

7294

7295

(4) http://www.expasy.org/tools/scanprosite/

7296

7297

(5) http://www.expasy.org/tools/scanprosite/scanprosite-doc.html

7298

7299

(6) http://www.expasy.org/tools/scanprosite/ScanPrositeREST.html

7300

7301

(7) http://www.expasy.org/tools/scanprosite/ScanPrositeREST.html

7302

7303

7304

Chapter 10 Going 3D: The PDB module

7305

**************************************

7306

7307

Biopython also allows you to explore the extensive realm of

7308

macromolecular structure. Biopython comes with a PDBParser class that

7309

produces a Structure object. The Structure object can be used to access

7310

the atomic data in the file in a convenient manner.

7311

7312

7313

10.1 Structure representation

7314

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

7315

7316

7317

A macromolecular structure is represented using a structure, model

7318

chain, residue, atom (or SMCRA) hierarchy. The figure below shows a

7319

UML class diagram of the SMCRA data structure. Such a data structure is

7320

not necessarily best suited for the representation of the macromolecular

7321

content of a structure, but it is absolutely necessary for a good

7322

interpretation of the data present in a file that describes the

7323

structure (typically a PDB or MMCIF file). If this hierarchy cannot

7324

represent the contents of a structure file, it is fairly certain that

7325

the file contains an error or at least does not describe the structure

7326

unambiguously. If a SMCRA data structure cannot be generated, there is

7327

reason to suspect a problem. Parsing a PDB file can thus be used to

7328

detect likely problems. We will give several examples of this in section

7329

10.5.1.

7330

*images/smcra.png*

7331

7332

Structure, Model, Chain and Residue are all subclasses of the Entity

7333

base class. The Atom class only (partly) implements the Entity interface

7334

(because an Atom does not have children).

7335

For each Entity subclass, you can extract a child by using a unique id

7336

for that child as a key (e.g. you can extract an Atom object from a

7337

Residue object by using an atom name string as a key, you can extract a

7338

Chain object from a Model object by using its chain identifier as a

7339

key).

7340

Disordered atoms and residues are represented by DisorderedAtom and

7341

DisorderedResidue classes, which are both subclasses of the

7342

DisorderedEntityWrapper base class. They hide the complexity associated

7343

with disorder and behave exactly as Atom and Residue objects.

7344

In general, a child Entity object (i.e. Atom, Residue, Chain, Model)

7345

can be extracted from its parent (i.e. Residue, Chain, Model, Structure,

7346

respectively) by using an id as a key.

7347

<<child_entity=parent_entity[child_id]

7348

>>

7349

7350

You can also get a list of all child Entities of a parent Entity

7351

object. Note that this list is sorted in a specific way (e.g. according

7352

to chain identifier for Chain objects in a Model object).

7353

<<child_list=parent_entity.get_list()

7354

>>

7355

7356

You can also get the parent from a child.

7357

<<parent_entity=child_entity.get_parent()

7358

>>

7359

7360

At all levels of the SMCRA hierarchy, you can also extract a full id.

7361

The full id is a tuple containing all id's starting from the top object

7362

(Structure) down to the current object. A full id for a Residue object

7363

e.g. is something like:

7364

<<full_id=residue.get_full_id()

7365

7366

print full_id

7367

7368

("1abc", 0, "A", ("", 10, "A"))

7369

>>

7370

7371

This corresponds to:

7372

7373

7374

- The Structure with id "1abc"

7375

- The Model with id 0

7376

- The Chain with id "A"

7377

- The Residue with id (" ", 10, "A").

7378

The Residue id indicates that the residue is not a hetero-residue

7379

(nor a water) because it has a blank hetero field, that its sequence

7380

identifier is 10 and that its insertion code is "A".

7381

Some other useful methods:

7382

<<# get the entity's id

7383

7384

entity.get_id()

7385

7386

# check if there is a child with a given id

7387

7388

entity.has_id(entity_id)

7389

7390

# get number of children

7391

7392

nr_children=len(entity)

7393

>>

7394

7395

It is possible to delete, rename, add, etc. child entities from a

7396

parent entity, but this does not include any sanity checks (e.g. it is

7397

possible to add two residues with the same id to one chain). This really

7398

should be done via a nice Decorator class that includes integrity

7399

checking, but you can take a look at the code (Entity.py) if you want to

7400

use the raw interface.

7401

7402

7403

10.1.1 Structure

7404

=================

7405

7406

The Structure object is at the top of the hierarchy. Its id is a user

7407

given string. The Structure contains a number of Model children. Most

7408

crystal structures (but not all) contain a single model, while NMR

7409

structures typically consist of several models. Disorder in crystal

7410

structures of large parts of molecules can also result in several

7411

models.

7412

7413

7414

10.1.1.1 Constructing a Structure object

7415

-----------------------------------------

7416

7417

A Structure object is produced by a PDBParser object:

7418

<<from Bio.PDB.PDBParser import PDBParser

7419

7420

p=PDBParser(PERMISSIVE=1)

7421

7422

structure_id="1fat"

7423

7424

filename="pdb1fat.ent"

7425

7426

s=p.get_structure(structure_id, filename)

7427

>>

7428

7429

The PERMISSIVE flag indicates that a number of common problems (see

7430

10.5.1) associated with PDB files will be ignored (but note that some

7431

atoms and/or residues will be missing). If the flag is not present a

7432

PDBConstructionException will be generated during the parse operation.

7433

7434

7435

10.1.1.2 Header and trailer

7436

----------------------------

7437

7438

You can extract the header and trailer (simple lists of strings) of

7439

the PDB file from the PDBParser object with the get_header and

7440

get_trailer methods.

7441

7442

7443

10.1.2 Model

7444

=============

7445

7446

The id of the Model object is an integer, which is derived from the

7447

position of the model in the parsed file (they are automatically

7448

numbered starting from 0). The Model object stores a list of Chain

7449

children.

7450

7451

7452

10.1.2.1 Example

7453

-----------------

7454

7455

Get the first model from a Structure object.

7456

<<first_model=structure[0]

7457

>>

7458

7459

7460

7461

10.1.3 Chain

7462

=============

7463

7464

The id of a Chain object is derived from the chain identifier in the

7465

structure file, and can be any string. Each Chain in a Model object has

7466

a unique id. The Chain object stores a list of Residue children.

7467

7468

7469

10.1.3.1 Example

7470

-----------------

7471

7472

Get the Chain object with identifier "A" from a Model object.

7473

<<chain_A=model["A"]

7474

>>

7475

7476

7477

7478

10.1.4 Residue

7479

===============

7480

7481

Unsurprisingly, a Residue object stores a set of Atom children. In

7482

addition, it also contains a string that specifies the residue name

7483

(e.g. "ASN") and the segment identifier of the residue (well known to

7484

X-PLOR users, but not used in the construction of the SMCRA data

7485

structure).

7486

The id of a Residue object is composed of three parts: the hetero

7487

field (hetfield), the sequence identifier (resseq) and the insertion

7488

code (icode).

7489

The hetero field is a string : it is "W" for waters, "H_" followed by

7490

the residue name (e.g. "H_FUC") for other hetero residues and blank for

7491

standard amino and nucleic acids. This scheme is adopted for reasons

7492

described in section 10.3.1.

7493

The second field in the Residue id is the sequence identifier, an

7494

integer describing the position of the residue in the chain.

7495

The third field is a string, consisting of the insertion code. The

7496

insertion code is sometimes used to preserve a certain desirable residue

7497

numbering scheme. A Ser 80 insertion mutant (inserted e.g. between a Thr

7498

80 and an Asn 81 residue) could e.g. have sequence identifiers and

7499

insertion codes as followed: Thr 80 A, Ser 80 B, Asn 81. In this way the

7500

residue numbering scheme stays in tune with that of the wild type

7501

structure.

7502

Let's give some examples. Asn 10 with a blank insertion code would

7503

have residue id (" ", 10, " "). Water 10 would have residue id ("W", 10,

7504

" "). A glucose molecule (a hetero residue with residue name GLC) with

7505

sequence identifier 10 would have residue id ("H_GLC", 10, " "). In this

7506

way, the three residues (with the same insertion code and sequence

7507

identifier) can be part of the same chain because their residue id's are

7508

distinct.

7509

In most cases, the hetflag and insertion code fields will be blank,

7510

e.g. (" ", 10, " "). In these cases, the sequence identifier can be used

7511

as a shortcut for the full id:

7512

<<# use full id

7513

7514

res10=chain[("", 10, "")]

7515

7516

# use shortcut

7517

7518

res10=chain[10]

7519

>>

7520

7521

Each Residue object in a Chain object should have a unique id.

7522

However, disordered residues are dealt with in a special way, as

7523

described in section 10.2.3.2.

7524

A Residue object has a number of additional methods:

7525

<<r.get_resname() # return residue name, e.g. "ASN"

7526

r.get_segid() # return the SEGID, e.g. "CHN1"

7527

>>

7528

7529

7530

7531

10.1.5 Atom

7532

============

7533

7534

The Atom object stores the data associated with an atom, and has no

7535

children. The id of an atom is its atom name (e.g. "OG" for the side

7536

chain oxygen of a Ser residue). An Atom id needs to be unique in a

7537

Residue. Again, an exception is made for disordered atoms, as described

7538

in section 10.2.2.

7539

In a PDB file, an atom name consists of 4 chars, typically with

7540

leading and trailing spaces. Often these spaces can be removed for ease

7541

of use (e.g. an amino acid C alpha atom is labeled ".CA." in a PDB

7542

file, where the dots represent spaces). To generate an atom name (and

7543

thus an atom id) the spaces are removed, unless this would result in a

7544

name collision in a Residue (i.e. two Atom objects with the same atom

7545

name and id). In the latter case, the atom name including spaces is

7546

tried. This situation can e.g. happen when one residue contains atoms

7547

with names ".CA." and "CA..", although this is not very likely.

7548

The atomic data stored includes the atom name, the atomic coordinates

7549

(including standard deviation if present), the B factor (including

7550

anisotropic B factors and standard deviation if present), the altloc

7551

specifier and the full atom name including spaces. Less used items like

7552

the atom element number or the atomic charge sometimes specified in a

7553

PDB file are not stored.

7554

An Atom object has the following additional methods:

7555

<<a.get_name() # atom name (spaces stripped, e.g. "CA")

7556

a.get_id() # id (equals atom name)

7557

a.get_coord() # atomic coordinates

7558

a.get_bfactor() # B factor

7559

a.get_occupancy() # occupancy

7560

a.get_altloc() # alternative location specifie

7561

a.get_sigatm() # std. dev. of atomic parameters

7562

a.get_siguij() # std. dev. of anisotropic B factor

7563

a.get_anisou() # anisotropic B factor

7564

a.get_fullname() # atom name (with spaces, e.g. ".CA.")

7565

>>

7566

7567

To represent the atom coordinates, siguij, anisotropic B factor and

7568

sigatm Numpy arrays are used.

7569

7570

7571

10.2 Disorder

7572

*=*=*=*=*=*=*=

7573

7574

7575

7576

7577

10.2.1 General approach

7578

========================

7579

7580

Disorder should be dealt with from two points of view: the atom and

7581

the residue points of view. In general, we have tried to encapsulate all

7582

the complexity that arises from disorder. If you just want to loop over

7583

all C alpha atoms, you do not care that some residues have a disordered

7584

side chain. On the other hand it should also be possible to represent

7585

disorder completely in the data structure. Therefore, disordered atoms

7586

or residues are stored in special objects that behave as if there is no

7587

disorder. This is done by only representing a subset of the disordered

7588

atoms or residues. Which subset is picked (e.g. which of the two

7589

disordered OG side chain atom positions of a Ser residue is used) can be

7590

specified by the user.

7591

7592

7593

10.2.2 Disordered atoms

7594

========================

7595

7596

Disordered atoms are represented by ordinary Atom objects, but all

7597

Atom objects that represent the same physical atom are stored in a

7598

DisorderedAtom object. Each Atom object in a DisorderedAtom object can

7599

be uniquely indexed using its altloc specifier. The DisorderedAtom

7600

object forwards all uncaught method calls to the selected Atom object,

7601

by default the one that represents the atom with with the highest

7602

occupancy. The user can of course change the selected Atom object,

7603

making use of its altloc specifier. In this way atom disorder is

7604

represented correctly without much additional complexity. In other

7605

words, if you are not interested in atom disorder, you will not be

7606

bothered by it.

7607

Each disordered atom has a characteristic altloc identifier. You can

7608

specify that a DisorderedAtom object should behave like the Atom object

7609

associated with a specific altloc identifier:

7610

<<atom.disordered\_select("A") # select altloc A atom

7611

7612

print atom.get_altloc()

7613

"A"

7614

7615

atom.disordered_select("B") # select altloc B atom

7616

print atom.get_altloc()

7617

"B"

7618

>>

7619

7620

7621

7622

10.2.3 Disordered residues

7623

===========================

7624

7625

7626

7627

10.2.3.1 Common case

7628

---------------------

7629

7630

The most common case is a residue that contains one or more disordered

7631

atoms. This is evidently solved by using DisorderedAtom objects to

7632

represent the disordered atoms, and storing the DisorderedAtom object in

7633

a Residue object just like ordinary Atom objects. The DisorderedAtom

7634

will behave exactly like an ordinary atom (in fact the atom with the

7635

highest occupancy) by forwarding all uncaught method calls to one of the

7636

Atom objects (the selected Atom object) it contains.

7637

7638

7639

10.2.3.2 Point mutations

7640

-------------------------

7641

7642

A special case arises when disorder is due to a point mutation, i.e.

7643

when two or more point mutants of a polypeptide are present in the

7644

crystal. An example of this can be found in PDB structure 1EN2.

7645

Since these residues belong to a different residue type (e.g. let's

7646

say Ser 60 and Cys 60) they should not be stored in a single Residue

7647

object as in the common case. In this case, each residue is represented

7648

by one Residue object, and both Residue objects are stored in a

7649

DisorderedResidue object.

7650

The DisorderedResidue object forwards all uncaught methods to the

7651

selected Residue object (by default the last Residue object added), and

7652

thus behaves like an ordinary residue. Each Residue object in a

7653

DisorderedResidue object can be uniquely identified by its residue name.

7654

In the above example, residue Ser 60 would have id "SER" in the

7655

DisorderedResidue object, while residue Cys 60 would have id "CYS". The

7656

user can select the active Residue object in a DisorderedResidue object

7657

via this id.

7658

7659

7660

10.3 Hetero residues

7661

*=*=*=*=*=*=*=*=*=*=*

7662

7663

7664

7665

7666

10.3.1 Associated problems

7667

===========================

7668

7669

A common problem with hetero residues is that several hetero and

7670

non-hetero residues present in the same chain share the same sequence

7671

identifier (and insertion code). Therefore, to generate a unique id for

7672

each hetero residue, waters and other hetero residues are treated in a

7673

different way.

7674

Remember that Residue object have the tuple (hetfield, resseq, icode)

7675

as id. The hetfield is blank (" ") for amino and nucleic acids, and a

7676

string for waters and other hetero residues. The content of the hetfield

7677

is explained below.

7678

7679

7680

10.3.2 Water residues

7681

======================

7682

7683

The hetfield string of a water residue consists of the letter "W". So

7684

a typical residue id for a water is ("W", 1, " ").

7685

7686

7687

10.3.3 Other hetero residues

7688

=============================

7689

7690

The hetfield string for other hetero residues starts with "H_"

7691

followed by the residue name. A glucose molecule e.g. with residue name

7692

"GLC" would have hetfield "H_GLC". It's residue id could e.g. be

7693

("H_GLC", 1, " ").

7694

7695

7696

10.4 Some random usage examples

7697

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

7698

7699

7700

Parse a PDB file, and extract some Model, Chain, Residue and Atom

7701

objects.

7702

<<from Bio.PDB.PDBParser import PDBParser

7703

7704

parser=PDBParser()

7705

7706

structure=parser.get_structure("test", "1fat.pdb")

7707

model=structure[0]

7708

chain=model["A"]

7709

residue=chain[1]

7710

atom=residue["CA"]

7711

>>

7712

7713

Extract a hetero residue from a chain (e.g. a glucose (GLC) moiety

7714

with resseq 10).

7715

<<residue_id=("H_GLC", 10, " ")

7716

residue=chain[residue_id]

7717

>>

7718

7719

Print all hetero residues in chain.

7720

<<for residue in chain.get_list():

7721

residue_id=residue.get_id()

7722

hetfield=residue_id[0]

7723

if hetfield[0]=="H":

7724

print residue_id

7725

>>

7726

7727

Print out the coordinates of all CA atoms in a structure with B factor

7728

greater than 50.

7729

<<for model in structure.get_list():

7730

for chain in model.get_list():

7731

for residue in chain.get_list():

7732

if residue.has_id("CA"):

7733

ca=residue["CA"]

7734

if ca.get_bfactor()>50.0:

7735

print ca.get_coord()

7736

>>

7737

7738

Print out all the residues that contain disordered atoms.

7739

<<for model in structure.get_list():

7740

for chain in model.get_list():

7741

for residue in chain.get_list():

7742

if residue.is_disordered():

7743

resseq=residue.get_id()[1]

7744

resname=residue.get_resname()

7745

model_id=model.get_id()

7746

chain_id=chain.get_id()

7747

print model_id, chain_id, resname, resseq

7748

>>

7749

7750

Loop over all disordered atoms, and select all atoms with altloc A (if

7751

present). This will make sure that the SMCRA data structure will behave

7752

as if only the atoms with altloc A are present.

7753

<<for model in structure.get_list():

7754

for chain in model.get_list():

7755

for residue in chain.get_list():

7756

if residue.is_disordered():

7757

for atom in residue.get_list():

7758

if atom.is_disordered():

7759

if atom.disordered_has_id("A"):

7760

atom.disordered_select("A")

7761

>>

7762

7763

Suppose that a chain has a point mutation at position 10, consisting

7764

of a Ser and a Cys residue. Make sure that residue 10 of this chain

7765

behaves as the Cys residue.

7766

<<residue=chain[10]

7767

residue.disordered_select("CYS")

7768

>>

7769

7770

7771

7772

10.5 Common problems in PDB files

7773

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

7774

7775

7776

7777

7778

10.5.1 Examples

7779

================

7780

7781

The PDBParser/Structure class was tested on about 800 structures (each

7782

belonging to a unique SCOP superfamily). This takes about 20 minutes, or

7783

on average 1.5 seconds per structure. Parsing the structure of the large

7784

ribosomal subunit (1FKK), which contains about 64000 atoms, takes 10

7785

seconds on a 1000 MHz PC.

7786

Three exceptions were generated in cases where an unambiguous data

7787

structure could not be built. In all three cases, the likely cause is an

7788

error in the PDB file that should be corrected. Generating an exception

7789

in these cases is much better than running the chance of incorrectly

7790

describing the structure in a data structure.

7791

7792

7793

10.5.1.1 Duplicate residues

7794

----------------------------

7795

7796

One structure contains two amino acid residues in one chain with the

7797

same sequence identifier (resseq 3) and icode. Upon inspection it was

7798

found that this chain contains the residues Thr A3, ..., Gly A202, Leu

7799

A3, Glu A204. Clearly, Leu A3 should be Leu A203. A couple of similar

7800

situations exist for structure 1FFK (which e.g. contains Gly B64, Met

7801

B65, Glu B65, Thr B67, i.e. residue Glu B65 should be Glu B66).

7802

7803

7804

10.5.1.2 Duplicate atoms

7805

-------------------------

7806

7807

Structure 1EJG contains a Ser/Pro point mutation in chain A at

7808

position 22. In turn, Ser 22 contains some disordered atoms. As

7809

expected, all atoms belonging to Ser 22 have a non-blank altloc

7810

specifier (B or C). All atoms of Pro 22 have altloc A, except the N atom

7811

which has a blank altloc. This generates an exception, because all atoms

7812

belonging to two residues at a point mutation should have non-blank

7813

altloc. It turns out that this atom is probably shared by Ser and Pro

7814

22, as Ser 22 misses the N atom. Again, this points to a problem in the

7815

file: the N atom should be present in both the Ser and the Pro residue,

7816

in both cases associated with a suitable altloc identifier.

7817

7818

7819

10.5.2 Automatic correction

7820

============================

7821

7822

Some errors are quite common and can be easily corrected without much

7823

risk of making a wrong interpretation. These cases are listed below.

7824

7825

7826

10.5.2.1 A blank altloc for a disordered atom

7827

----------------------------------------------

7828

7829

Normally each disordered atom should have a non-blank altloc

7830

identifier. However, there are many structures that do not follow this

7831

convention, and have a blank and a non-blank identifier for two

7832

disordered positions of the same atom. This is automatically interpreted

7833

in the right way.

7834

7835

7836

10.5.2.2 Broken chains

7837

-----------------------

7838

7839

Sometimes a structure contains a list of residues belonging to chain

7840

A, followed by residues belonging to chain B, and again followed by

7841

residues belonging to chain A, i.e. the chains are "broken". This is

7842

correctly interpreted.

7843

7844

7845

10.5.3 Fatal errors

7846

====================

7847

7848

Sometimes a PDB file cannot be unambiguously interpreted. Rather than

7849

guessing and risking a mistake, an exception is generated, and the user

7850

is expected to correct the PDB file. These cases are listed below.

7851

7852

7853

10.5.3.1 Duplicate residues

7854

----------------------------

7855

7856

All residues in a chain should have a unique id. This id is generated

7857

based on:

7858

7859

7860

- The sequence identifier (resseq).

7861

- The insertion code (icode).

7862

- The hetfield string ("W" for waters and "H_" followed by the

7863

residue name for other hetero residues)

7864

- The residue names of the residues in the case of point mutations

7865

(to store the Residue objects in a DisorderedResidue object).

7866

If this does not lead to a unique id something is quite likely wrong,

7867

and an exception is generated.

7868

7869

7870

10.5.3.2 Duplicate atoms

7871

-------------------------

7872

7873

All atoms in a residue should have a unique id. This id is generated

7874

based on:

7875

7876

7877

- The atom name (without spaces, or with spaces if a problem arises).

7878

7879

- The altloc specifier.

7880

If this does not lead to a unique id something is quite likely wrong,

7881

and an exception is generated.

7882

7883

7884

10.6 Other features

7885

*=*=*=*=*=*=*=*=*=*=

7886

7887

7888

There are also some tools to analyze a crystal structure. Tools exist

7889

to superimpose two coordinate sets (SVDSuperimposer), to extract

7890

polypeptides from a structure (Polypeptide), to perform neighbor lookup

7891

(NeighborSearch) and to write out PDB files (PDBIO). The neighbor lookup

7892

is done using a KD tree module written in C++. It is very fast and also

7893

includes a fast method to find all point pairs within a certain distance

7894

of each other.

7895

A Polypeptide object is simply a UserList of Residue objects. You can

7896

construct a list of Polypeptide objects from a Structure object as

7897

follows:

7898

<<model_nr=1

7899

polypeptide_list=build_peptides(structure, model_nr)

7900

7901

for polypeptide in polypeptide_list:

7902

print polypeptide

7903

>>

7904

7905

The Polypeptide objects are always created from a single Model (in

7906

this case model 1).

7907

7908

7909

Chapter 11 Bio.PopGen: Population genetics

7910

*********************************************

7911

7912

Bio.PopGen is a new Biopython module supporting population genetics,

7913

available in Biopython 1.44 onwards.

7914

The medium term objective for the module is to support widely used

7915

data formats, applications and databases. This module is currently under

7916

intense development and support for new features should appear at a

7917

rather fast pace. Unfortunately this might also entail some instability

7918

on the API, especially if you are using a CVS version. APIs that are

7919

made available on public versions should be much stabler.

7920

7921

7922

11.1 GenePop

7923

*=*=*=*=*=*=*

7924

7925

7926

GenePop (http://genepop.curtin.edu.au/) is a popular population

7927

genetics software package supporting Hardy-Weinberg tests, linkage

7928

desiquilibrium, population diferentiation, basic statistics, F_st and

7929

migration estimates, among others. GenePop does not supply sequence

7930

based statistics as it doesn't handle sequence data. The GenePop file

7931

format is supported by a wide range of other population genetic software

7932

applications, thus making it a relevant format in the population

7933

genetics field.

7934

Bio.PopGen provides a parser and generator of GenePop file format.

7935

Utilities to manipulate the content of a record are also provided. Here

7936

is an example on how to read a GenePop file (you can find example

7937

GenePop data files in the Test/PopGen directory of Biopython):

7938

<<from Bio.PopGen import GenePop

7939

7940

handle = open("example.gen")

7941

rec = GenePop.parse(handle)

7942

handle.close()

7943

>>

7944

7945

This will read a file called example.gen and parse it. If you do print

7946

rec, the record will be output again, in GenePop format.

7947

The most important information in rec will be the loci names and

7948

population information (but there is more -- use help(GenePop.Record) to

7949

check the API documentation). Loci names can be found on rec.loci_list.

7950

Population information can be found on rec.populations. Populations is a

7951

list with one element per population. Each element is itself a list of

7952

individuals, each individual is a pair composed by individual name and a

7953

list of alleles (2 per marker), here is an example for rec.populations:

7954

<<[

7955

[

7956

('Ind1', [(1, 2), (3, 3), (200, 201)],

7957

('Ind2', [(2, None), (3, 3), (None, None)],

7958

],

7959

[

7960

('Other1', [(1, 1), (4, 3), (200, 200)],

7961

]

7962

]

7963

>>

7964

7965

So we have two populations, the first with two individuals, the second

7966

with only one. The first individual of the first population is called

7967

Ind1, allelic information for each of the 3 loci follows. Please note

7968

that for any locus, information might be missing (see as an example,

7969

Ind2 above).

7970

A few utility functions to manipulate GenePop records are made

7971

available, here is an example:

7972

<<from Bio.PopGen import GenePop

7973

7974

#Imagine that you have loaded rec, as per the code snippet above...

7975

7976

rec.remove_population(pos)

7977

#Removes a population from a record, pos is the population position in

7978

# rec.populations, remember that it starts on position 0.

7979

# rec is altered.

7980

7981

rec.remove_locus_by_position(pos)

7982

#Removes a locus by its position, pos is the locus position in

7983

# rec.loci_list, remember that it starts on position 0.

7984

# rec is altered.

7985

7986

rec.remove_locus_by_name(name)

7987

#Removes a locus by its name, name is the locus name as in

7988

# rec.loci_list. If the name doesn't exist the function fails

7989

# silently.

7990

# rec is altered.

7991

7992

rec_loci = rec.split_in_loci()

7993

#Splits a record in loci, that is, for each loci, it creates a new

7994

# record, with a single loci and all populations.

7995

# The result is returned in a dictionary, being each key the locus

7996

name.

7997

# The value is the GenePop record.

7998

# rec is not altered.

7999

8000

rec_pops = rec.split_in_pops(pop_names)

8001

#Splits a record in populations, that is, for each population, it

8002

creates

8003

# a new record, with a single population and all loci.

8004

# The result is returned in a dictionary, being each key

8005

# the population name. As population names are not available in

8006

GenePop,

8007

# they are passed in array (pop_names).

8008

# The value of each dictionary entry is the GenePop record.

8009

# rec is not altered.

8010

>>

8011

8012

GenePop does not support population names, a limitation which can be

8013

cumbersome at times. Functionality to enable population names is

8014

currently being planned for Biopython. These extensions won't break

8015

compatibility in any way with the standard format. In the medium term,

8016

we would also like to support the GenePop web service.

8017

8018

8019

11.2 Coalescent simulation

8020

*=*=*=*=*=*=*=*=*=*=*=*=*=*

8021

8022

8023

A coalescent simulation is a backward model of population genetics

8024

with relation to time. A simulation of ancestry is done until the Most

8025

Recent Common Ancestor (MRCA) is found. This ancestry relationship

8026

starting on the MRCA and ending on the current generation sample is

8027

sometimes called a genealogy. Simple cases assume a population of

8028

constant size in time, haploidy, no population structure, and simulate

8029

the alleles of a single locus under no selection pressure.

8030

Coalescent theory is used in many fields like selection detection,

8031

estimation of demographic parameters of real populations or disease gene

8032

mapping.

8033

The strategy followed in the Biopython implementation of the

8034

coalescent was not to create a new, built-in, simulator from scratch but

8035

to use an existing one, SIMCOAL2

8036

(http://cmpg.unibe.ch/software/simcoal2/). SIMCOAL2 allows for, among

8037

others, population structure, multiple demographic events, simulation of

8038

multiple types of loci (SNPs, sequences, STRs/microsatellites and RFLPs)

8039

with recombination, diploidy multiple chromosomes or ascertainment bias.

8040

Notably SIMCOAL2 doesn't support any selection model. We recommend

8041

reading SIMCOAL2's documentation, available in the link above.

8042

The input for SIMCOAL2 is a file specifying the desired demography and

8043

genome, the output is a set of files (typically around 1000) with the

8044

simulated genomes of a sample of individuals per subpopulation. This set

8045

of files can be used in many ways, like to compute confidence intervals

8046

where which certain statistics (e.g., F_st or Tajima D) are expected to

8047

lie. Real population genetics datasets statistics can then be compared

8048

to those confidence intervals.

8049

Biopython coalescent code allows to create demographic scenarios and

8050

genomes and to run SIMCOAL2.

8051

8052

8053

11.2.1 Creating scenarios

8054

==========================

8055

8056

Creating a scenario involves both creating a demography and a

8057

chromosome structure. In many cases (e.g. when doing Approximate

8058

Bayesian Computations -- ABC) it is important to test many parameter

8059

variations (e.g. vary the effective population size, N_e, between 10,

8060

50, 500 and 1000 individuals). The code provided allows for the

8061

simulation of scenarios with different demographic parameters very

8062

easily.

8063

Below we see how we can create scenarios and then how simulate them.

8064

8065

8066

11.2.1.1 Demography

8067

--------------------

8068

8069

A few predefined demographies are built-in, all have two shared

8070

parameters: sample size (called sample_size on the template, see below

8071

for its use) per deme and deme size, i.e. subpopulation size (pop_size).

8072

All demographies are available as templates where all parameters can be

8073

varied, each template has a system name. The prefedined

8074

demographies/templates are:

8075

8076

8077

Single population, constant size The standard parameters are enough to

8078

specifity it. Template name: simple.

8079

Single population, bottleneck As seen on figure 11.2.1.1. The

8080

parameters are current population size (pop_size on template ne3 on

8081

figure), time of expansion, given as the generation in the past when

8082

it occured (expand_gen), effective population size during bottleneck

8083

(ne2), time of contraction (contract_gen) and original size in the

8084

remote past (ne3). Template name: bottle.

8085

Island model The typical island model. The total number of demes is

8086

specified by total_demes and the migration rate by mig. Template name

8087

island.

8088

Stepping stone model - 1 dimension The stepping stone model in 1

8089

dimension, extremes disconnected. The total number of demes is

8090

total_demes, migration rate is mig. Template name is ssm_1d.

8091

Stepping stone model - 2 dimensions The stepping stone model in 2

8092

dimensions, extremes disconnected. The parameters are x for the

8093

horizontal dimension and y for the vertical (being the total number

8094

of demes x times y), migration rate is mig. Template name is ssm_2d.

8095

8096

*images/bottle.png*

8097

8098

In our first example, we will generate a template for a single

8099

population, constant size model with a sample size of 30 and a deme size

8100

of 500. The code for this is:

8101

<<from Bio.PopGen.SimCoal.Template import generate_simcoal_from_template

8102

8103

generate_simcoal_from_template('simple',

8104

[(1, [('SNP', [24, 0.0005, 0.0])])],

8105

[('sample_size', [30]),

8106

('pop_size', [100])])

8107

>>

8108

8109

Executing this code snippet will generate a file on the current

8110

directory called simple_100_300.par this file can be given as input to

8111

SIMCOAL2 to simulate the demography (below we will see how Biopython can

8112

take care of calling SIMCOAL2).

8113

This code consists of a single function call, let's discuss it

8114

parameter by parameter.

8115

The first parameter is the template id (from the list above). We are

8116

using the id 'simple' which is the template for a single population of

8117

constant size along time.

8118

The second parameter is the chromosome structure. Please ignore it for

8119

now, it will be explained in the next section.

8120

The third parameter is a list of all required parameters (recall that

8121

the simple model only needs sample_size and pop_size) and possible

8122

values (in this case each parameter only has a possible value).

8123

Now, let's consider an example where we want to generate several

8124

island models, and we are interested in varying the number of demes: 10,

8125

50 and 100 with a migration rate of 1%. Sample size and deme size will

8126

be the same as before. Here is the code:

8127

<<from Bio.PopGen.SimCoal.Template import generate_simcoal_from_template

8128

8129

generate_simcoal_from_template('island',

8130

[(1, [('SNP', [24, 0.0005, 0.0])])],

8131

[('sample_size', [30]),

8132

('pop_size', [100]),

8133

('mig', [0.01]),

8134

('total_demes', [10, 50, 100])])

8135

>>

8136

8137

In this case, 3 files will be generated: island_100_0.01_100_30.par,

8138

island_10_0.01_100_30.par and island_50_0.01_100_30.par. Notice the rule

8139

to make file names: template name, followed by parameter values in

8140

reverse order.

8141

A few, arguably more esoteric template demographies exist (please

8142

check the Bio/PopGen/SimCoal/data directory on Biopython source tree).

8143

Furthermore it is possible for the user to create new templates. That

8144

functionality will be discussed in a future version of this document.

8145

8146

8147

11.2.1.2 Chromosome structure

8148

------------------------------

8149

8150

We strongly recommend reading SIMCOAL2 documentation to understand the

8151

full potential available in modeling chromosome structures. In this

8152

subsection we only discuss how to implement chromosome structures using

8153

the Biopython interface, not the underlying SIMCOAL2 capabilities.

8154

We will start by implementing a single chromosome, with 24 SNPs with a

8155

recombination rate immediately on the right of each locus of 0.0005 and

8156

a minimum frequency of the minor allele of 0. This will be specified by

8157

the following list (to be passed as second parameter to the function

8158

generate_simcoal_from_template):

8159

<<[(1, [('SNP', [24, 0.0005, 0.0])])]

8160

>>

8161

8162

This is actually the chromosome structure used in the above examples.

8163

The chromosome structure is represented by a list of chromosomes, each

8164

chromosome (i.e., each element in the list) is composed by a tuple (a

8165

pair): the first element is the number of times the chromosome is to be

8166

repeated (as there might be interest in repeating the same chromosome

8167

many times). The second element is a list of the actual components of

8168

the chromosome. Each element is again a pair, the first member is the

8169

locus type and the second element the parameters for that locus type.

8170

Confused? Before showing more examples let's review the example above:

8171

We have a list with one element (thus one chromosome), the chromosome is

8172

a single instance (therefore not to be repeated), it is composed of 24

8173

SNPs, with a recombination rate of 0.0005 between each consecutive SNP,

8174

the minimum frequency of the minor allele is 0.0 (i.e, it can be absent

8175

from a certain population).

8176

Let's see a more complicated example:

8177

<<[

8178

(5, [

8179

('SNP', [24, 0.0005, 0.0])

8180

]

8181

),

8182

(2, [

8183

('DNA', [10, 0.0, 0.00005, 0.33]),

8184

('RFLP', [1, 0.0, 0.0001]),

8185

('MICROSAT', [1, 0.0, 0.001, 0.0, 0.0])

8186

]

8187

)

8188

]

8189

>>

8190

8191

We start by having 5 chromosomes with the same structure as above

8192

(i.e., 24 SNPs). We then have 2 chromosomes which have a DNA sequence

8193

with 10 nucleotides, 0.0 recombination rate, 0.0005 mutation rate, and a

8194

transition rate of 0.33. Then we have an RFLP with 0.0 recombination

8195

rate to the next locus and a 0.0001 mutation rate. Finally we have a

8196

microsatellite (or STR), with 0.0 recombination rate to the next locus

8197

(note, that as this is a single microsatellite which has no loci

8198

following, this recombination rate here is irrelevant), with a mutation

8199

rate of 0.001, geometric paramater of 0.0 and a range constraint of 0.0

8200

(for information about this parameters please consult the SIMCOAL2

8201

documentation, you can use them to simulate various mutation models,

8202

including the typical -- for microsatellites -- stepwise mutation model

8203

among others).

8204

8205

8206

11.2.2 Running SIMCOAL2

8207

========================

8208

8209

We now discuss how to run SIMCOAL2 from inside Biopython. It is

8210

required that the binary for SIMCOAL2 is called simcoal2 (or

8211

simcoal2.exe on Windows based platforms), please note that the typical

8212

name when downloading the program is in the format simcoal2_x_y. As

8213

such, when installing SIMCOAL2 you will need to rename of the downloaded

8214

executable so that Biopython can find it.

8215

It is possible to run SIMCOAL2 on files that were not generated using

8216

the method above (e.g., writing a parameter file by hand), but we will

8217

show an example by creating a model using the framework presented above.

8218

<<from Bio.PopGen.SimCoal.Template import generate_simcoal_from_template

8219

from Bio.PopGen.SimCoal.Controller import SimCoalController

8220

8221

8222

generate_simcoal_from_template('simple',

8223

[

8224

(5, [

8225

('SNP', [24, 0.0005, 0.0])

8226

]

8227

),

8228

(2, [

8229

('DNA', [10, 0.0, 0.00005, 0.33]),

8230

('RFLP', [1, 0.0, 0.0001]),

8231

('MICROSAT', [1, 0.0, 0.001, 0.0, 0.0])

8232

]

8233

)

8234

],

8235

[('sample_size', [30]),

8236

('pop_size', [100])])

8237

8238

ctrl = SimCoalController('.')

8239

ctrl.run_simcoal('simple_100_30.par', 50)

8240

>>

8241

8242

The lines of interest are the last two (plus the new import). Firstly

8243

a controller for the application is created. The directory where the

8244

binary is located has to be specified.

8245

The simulator is then run on the last line: we know, from the rules

8246

explained above, that the input file name is simple_100_30.par for the

8247

simulation parameter file created. We then specify that we want to run

8248

50 independent simulations, by default Biopython requests a simulation

8249

of diploid data, but a third parameter can be added to simulate haploid

8250

data (adding as a parameter the string '0'). SIMCOAL2 will now run

8251

(please note that this can take quite a lot of time) and will create a

8252

directory with the simulation results. The results can now be analysed

8253

(typically studying the data with Arlequin3). In the future Biopython

8254

might support reading the Arlequin3 format and thus allowing for the

8255

analysis of SIMCOAL2 data inside Biopython.

8256

8257

8258

11.3 Other applications

8259

*=*=*=*=*=*=*=*=*=*=*=*=

8260

8261

8262

Here we discuss interfaces and utilities to deal with population

8263

genetics' applications which arguably have a smaller user base.

8264

8265

8266

11.3.1 FDist: Detecting selection and molecular adaptation

8267

===========================================================

8268

8269

FDist is a selection detection application suite based on computing

8270

(i.e. simulating) a "neutral" confidence interval based on F_st and

8271

heterozygosity. Markers (which can be SNPs, microsatellites, AFLPs among

8272

others) which lie outside the "neutral" interval are to be considered as

8273

possible candidates for being under selection.

8274

FDist is mainly used when the number of markers is considered enough

8275

to estimate an average F_st, but not enough to either have outliers

8276

calculated from the dataset directly or, with even more markers for

8277

which the relative positions in the genome are known, to use approaches

8278

based on, e.g., Extended Haplotype Heterozygosity (EHH).

8279

The typical usage pattern for FDist is as follows:

8280

8281

8282

1. Import a dataset from an external format into FDist format.

8283

2. Compute average F_st. This is done by datacal inside FDist.

8284

3. Simulate "neutral" markers based on the average F_st and expected

8285

number of total populations. This is the core operation, done by

8286

fdist inside FDist.

8287

4. Calculate the confidence interval, based on the desired confidence

8288

boundaries (typically 95% or 99%). This is done by cplot and is

8289

mainly used to plot the interval.

8290

5. Assess each marker status against the simulation "neutral"

8291

confidence interval. Done by pv. This is used to detect the outlier

8292

status of each marker against the simulation.

8293

8294

We will now discuss each step with illustrating example code (for this

8295

example to work FDist binaries have to be on the executable PATH).

8296

The FDist data format is application specific and is not used at all

8297

by other applications, as such you will probably have to convert your

8298

data for use with FDist. Biopython can help you do this. Here is an

8299

example converting from GenePop format to FDist format (along with

8300

imports that will be needed on examples further below):

8301

<<from Bio.PopGen import GenePop

8302

from Bio.PopGen import FDist

8303

from Bio.PopGen.FDist import Controller

8304

from Bio.PopGen.FDist.Utils import convert_genepop_to_fdist

8305

8306

gp_rec = GenePop.parse(open("example.gen"))

8307

fd_rec = convert_genepop_to_fdist(gp_rec)

8308

in_file = open("infile", "w")

8309

in_file.write(str(fd_rec))

8310

in_file.close()

8311

>>

8312

8313

In this code we simply parse a GenePop file and convert it to a FDist

8314

record.

8315

Printing an FDist record will generate a string that can be directly

8316

saved to a file and supplied to FDist. FDist requires the input file to

8317

be called infile, therefore we save the record on a file with that name.

8318

The most important fields on a FDist record are: num_pops, the number

8319

of populations; num_loci, the number of loci and loci_data with the

8320

marker data itself. Most probably the details of the record are of no

8321

interest to the user, as the record only purpose is to be passed to

8322

FDist.

8323

The next step is to calculate the average F_st of the dataset (along

8324

with the sample size):

8325

<<ctrl = Controller.FDistController()

8326

fst, samp_size = ctrl.run_datacal()

8327

>>

8328

8329

On the first line we create an object to control the call of FDist

8330

suite, this object will be used further on in order to call other suite

8331

applications.

8332

On the second line we call the datacal application which computes the

8333

average F_st and the sample size. It is worth noting that the F_st

8334

computed by datacal is a variation of Weir and Cockerham's theta.

8335

We can now call the main fdist application in order to simulate

8336

neutral markers.

8337

<<sim_fst = ctrl.run_fdist(npops = 15, nsamples = fd_rec.num_pops, fst =

8338

fst,

8339

sample_size = samp_size, mut = 0, num_sims = 40000)

8340

>>

8341

8342

8343

8344

npops Number of populations existing in nature. This is really a

8345

"guestimate". Has to be lower than 100.

8346

nsamples Number of populations sampled, has to be lower than npops.

8347

fst Average F_st.

8348

sample_size Average number of individuals sampled on each population.

8349

mut Mutation model: 0 - Infinite alleles; 1 - Stepwise mutations

8350

num_sims Number of simulations to perform. Typically a number around

8351

40000 will be OK, but if you get a confidence interval that looks

8352

sharp (this can be detected when plotting the confidence interval

8353

computed below) the value can be increased (a suggestion would be

8354

steps of 10000 simulations).

8355

8356

The confusion in wording between number of samples and sample size

8357

stems from the original application.

8358

A file named out.dat will be created with the simulated

8359

heterozygosities and F_sts, it will have as many lines as the number of

8360

simulations requested.

8361

Note that fdist returns the average F_st that it was capable of

8362

simulating, for more details about this issue please read below the

8363

paragraph on approximating the desired average F_st.

8364

The next (optional) step is to calculate the confidence interval:

8365

<<cpl_interval = ctrl.run_cplot(ci=0.99)

8366

>>

8367

8368

You can only call cplot after having run fdist.

8369

This will calculate the confidence intervals (99% in this case) for a

8370

previous fdist run. A list of quadruples is returned. The first element

8371

represents the heterozygosity, the second the lower bound of F_st

8372

confidence interval for that heterozygosity, the third the average and

8373

the fourth the upper bound. This can be used to trace the confidence

8374

interval contour. This list is also written to a file, out.cpl.

8375

The main purpose of this step is return a set of points which can be

8376

easily used to plot a confidence interval. It can be skipped if the

8377

objective is only to assess the status of each marker against the

8378

simulation, which is the next step...

8379

<<pv_data = ctrl.run_pv()

8380

>>

8381

8382

You can only call cplot after having run datacal and fdist.

8383

This will use the simulated markers to assess the status of each

8384

individual real marker. A list, in the same order than the loci_list

8385

that is on the FDist record (which is in the same order that the GenePop

8386

record) is returned. Each element in the list is a quadruple, the

8387

fundamental member of each quadruple is the last element (regarding the

8388

other elements, please refer to the pv documentation -- for the sake of

8389

simplicity we will not discuss them here) which returns the probability

8390

of the simulated F_st being lower than the marker F_st. Higher values

8391

would indicate a stronger candidate for positive selection, lower values

8392

a candidate for balancing selection, and intermediate values a possible

8393

neutral marker. What is "higher", "lower" or "intermediate" is really a

8394

subjective issue, but taking a "confidence interval" approach and

8395

considering a 95% confidence interval, "higher" would be between 0.95

8396

and 1.0, "lower" between 0.0 and 0.05 and "intermediate" between 0.05

8397

and 0.95.

8398

8399

8400

11.3.1.1 Approximating the desired average F_st

8401

------------------------------------------------

8402

8403

Fdist tries to approximate the desired average F_st by doing a

8404

coalescent simulation using migration rates based on the formula

8405

1 - F

8406

8407

N st

8408

= -------

8409

m 4F

8410

8411

st

8412

8413

This formula assumes a few premises like an infinite number of

8414

populations.

8415

In practice, when the number of populations is low, the mutation model

8416

is stepwise and the sample size increases, fdist will not be able to

8417

simulate an acceptable approximate average F_st.

8418

To address that, a function is provided to iteratively approach the

8419

desired value by running several fdists in sequence. This approach is

8420

computationally more intensive than running a single fdist run, but

8421

yields good results. The following code runs fdist approximating the

8422

desired F_st:

8423

<<sim_fst = ctrl.run_fdist_force_fst(npops = 15, nsamples =

8424

fd_rec.num_pops,

8425

fst = fst, sample_size = samp_size, mut = 0, num_sims = 40000,

8426

limit = 0.05)

8427

>>

8428

8429

The only new optional parameter, when comparing with run_fdist, is

8430

limit which is the desired maximum error. run_fdist can (and probably

8431

should) be safely replaced with run_fdist_force_fst.

8432

8433

8434

11.3.1.2 Final notes

8435

---------------------

8436

8437

The process to determine the average F_st can be more sophisticated

8438

than the one presented here. For more information we refer you to the

8439

FDist README file. Biopython's code can be used to implement more

8440

sophisticated approaches.

8441

8442

8443

11.4 Future Developments

8444

*=*=*=*=*=*=*=*=*=*=*=*=*

8445

8446

8447

The most desired future developments would be the ones you add

8448

yourself ;) .

8449

That being said, already existing fully functional code is currently

8450

being incorporated in Bio.PopGen, that code covers the applications

8451

FDist and SimCoal2, the HapMap and UCSC Table Browser databases and some

8452

simple statistics like F_st, or allele counts.

8453

8454

8455

Chapter 12 Supervised learning methods

8456

*****************************************

8457

8458

Note the supervised learning methods described in this chapter all

8459

require Numerical Python (numpy) to be installed.

8460

8461

8462

12.1 The Logistic Regression Model

8463

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

8464

8465

8466

8467

8468

12.1.1 Background and Purpose

8469

==============================

8470

8471

Logistic regression is a supervised learning approach that attempts to

8472

distinguish K classes from each other using a weighted sum of some

8473

predictor variables x_i. The logistic regression model is used to

8474

calculate the weights beta_i of the predictor variables. In Biopython,

8475

the logistic regression model is currently implemented for two classes

8476

only (K = 2); the number of predictor variables has no predefined limit.

8477

As an example, let's try to predict the operon structure in bacteria.

8478

An operon is a set of adjacent genes on the same strand of DNA that are

8479

transcribed into a single mRNA molecule. Translation of the single mRNA

8480

molecule then yields the individual proteins. For Bacillus subtilis,

8481

whose data we will be using, the average number of genes in an operon is

8482

about 2.4.

8483

As a first step in understanding gene regulation in bacteria, we need

8484

to know the operon structure. For about 10% of the genes in Bacillus

8485

subtilis, the operon structure is known from experiments. A supervised

8486

learning method can be used to predict the operon structure for the

8487

remaining 90% of the genes.

8488

For such a supervised learning approach, we need to choose some

8489

predictor variables x_i that can be measured easily and are somehow

8490

related to the operon structure. One predictor variable might be the

8491

distance in base pairs between genes. Adjacent genes belonging to the

8492

same operon tend to be separated by a relatively short distance, whereas

8493

adjacent genes in different operons tend to have a larger space between

8494

them to allow for promoter and terminator sequences. Another predictor

8495

variable is based on gene expression measurements. By definition, genes

8496

belonging to the same operon have equal gene expression profiles, while

8497

genes in different operons are expected to have different expression

8498

profiles. In practice, the measured expression profiles of genes in the

8499

same operon are not quite identical due to the presence of measurement

8500

errors. To assess the similarity in the gene expression profiles, we

8501

assume that the measurement errors follow a normal distribution and

8502

calculate the corresponding log-likelihood score.

8503

We now have two predictor variables that we can use to predict if two

8504

adjacent genes on the same strand of DNA belong to the same operon:

8505

8506

- x_1: the number of base pairs between them;

8507

- x_2: their similarity in expression profile.

8508

8509

In a logistic regression model, we use a weighted sum of these two

8510

predictors to calculate a joint score S:

8511

S = beta + beta x + beta x

8512

. (12.1)

8513

0 1 1 2 2

8514

The logistic regression model gives us appropriate values for the

8515

parameters beta_0, beta_1, beta_2 using two sets of example genes:

8516

8517

- OP: Adjacent genes, on the same strand of DNA, known to belong to

8518

the same operon;

8519

- NOP: Adjacent genes, on the same strand of DNA, known to belong to

8520

different operons.

8521

8522

In the logistic regression model, the probability of belonging to a

8523

class depends on the score via the logistic function. For the two

8524

classes OP and NOP, we can write this as

8525

(beta + beta x + beta x

8526

8527

exp )

8528

8529

|x , x 0 1 1 2 2

8530

8531

Pr(OP ) = ----------------------------------

8532

(12.2)

8533

1 2 (beta + beta x + beta x

8534

8535

1+exp )

8536

8537

0 1 1 2 2

8538

8539

1

8540

8541

|x , x ----------------------------------

8542

8543

Pr(NOP ) = (beta + beta x + beta x

8544

(12.3)

8545

1 2 1+exp )

8546

8547

0 1 1 2 2

8548

8549

Using a set of gene pairs for which it is known whether they belong

8550

to the same operon (class OP) or to different operons (class NOP), we

8551

can calculate the weights beta_0, beta_1, beta_2 by maximizing the

8552

log-likelihood corresponding to the probability functions (12.2) and

8553

(12.3).

8554

8555

8556

12.1.2 Training the logistic regression model

8557

==============================================

8558

8559

--------------------------------------------------------

8560

8561

8562

Table 12.1: Adjacent gene pairs known to belong to the same operon

8563

(class OP) or to different operons (class NOP). Intergene distances are

8564

negative if the two genes overlap.

8565

8566

------------------------------------------------------------------------

8567

----

8568

| Gene pair |Intergene distance (x_1)|Gene expression score

8569

(x_2)|Class|

8570

------------------------------------------------------------------------

8571

----

8572

|cotJA --- cotJB| -53 | -200.78 |

8573

OP |

8574

| yesK --- yesL | 117 | -267.14 |

8575

OP |

8576

| lplA --- lplB | 57 | -163.47 |

8577

OP |

8578

| lplB --- lplC | 16 | -190.30 |

8579

OP |

8580

| lplC --- lplD | 11 | -220.94 |

8581

OP |

8582

| lplD --- yetF | 85 | -193.94 |

8583

OP |

8584

| yfmT --- yfmS | 16 | -182.71 |

8585

OP |

8586

| yfmF --- yfmE | 15 | -180.41 |

8587

OP |

8588

| citS --- citT | -26 | -181.73 |

8589

OP |

8590

| citM --- yflN | 58 | -259.87 |

8591

OP |

8592

| yfiI --- yfiJ | 126 | -414.53 |

8593

NOP |

8594

| lipB --- yfiQ | 191 | -249.57 |

8595

NOP |

8596

| yfiU --- yfiV | 113 | -265.28 |

8597

NOP |

8598

| yfhH --- yfhI | 145 | -312.99 |

8599

NOP |

8600

| cotY --- cotX | 154 | -213.83 |

8601

NOP |

8602

| yjoB --- rapA | 147 | -380.85 |

8603

NOP |

8604

| ptsI --- splA | 93 | -291.13 |

8605

NOP |

8606

------------------------------------------------------------------------

8607

----

8608

8609

8610

--------------------------------------------------------

8611

8612

Table 12.1 lists some of the Bacillus subtilis gene pairs for which

8613

the operon structure is known. Let's calculate the logistic regression

8614

model from these data:

8615

<<>>> from Bio import LogisticRegression

8616

>>> xs = [[-53, -200.78],

8617

[117, -267.14],

8618

[57, -163.47],

8619

[16, -190.30],

8620

[11, -220.94],

8621

[85, -193.94],

8622

[16, -182.71],

8623

[15, -180.41],

8624

[-26, -181.73],

8625

[58, -259.87],

8626

[126, -414.53],

8627

[191, -249.57],

8628

[113, -265.28],

8629

[145, -312.99],

8630

[154, -213.83],

8631

[147, -380.85],

8632

[93, -291.13]]

8633

>>> ys = [1,

8634

1,

8635

1,

8636

1,

8637

1,

8638

1,

8639

1,

8640

1,

8641

1,

8642

1,

8643

0,

8644

0,

8645

0,

8646

0,

8647

0,

8648

0,

8649

0]

8650

>>> model = LogisticRegression.train(xs, ys)

8651

>>

8652

8653

Here, 'xs' and 'ys' are the training data: 'xs' contains the predictor

8654

variables for each gene pair, and 'ys' specifies if the gene pair

8655

belongs to the same operon ('1', class OP) or different operons ('0',

8656

class NOP). The resulting logistic regression model is stored in

8657

'model', which contains the weights beta_0, beta_1, and beta_2:

8658

<<>>> model.beta

8659

[8.9830290157144681, -0.035968960444850887, 0.02181395662983519]

8660

>>

8661

8662

Note that beta_1 is negative, as gene pairs with a shorter intergene

8663

distance have a higher probability of belonging to the same operon

8664

(class OP). On the other hand, beta_2 is positive, as gene pairs

8665

belonging to the same operon typically have a higher similarity score of

8666

their gene expression profiles. The parameter beta_0 is positive due to

8667

the higher prevalence of operon gene pairs than non-operon gene pairs in

8668

the training data.

8669

The function 'train' has two optional arguments: 'update_fn' and

8670

'typecode'. The 'update_fn' can be used to specify a callback function,

8671

taking as arguments the iteration number and the log-likelihood. With

8672

the callback function, we can for example track the progress of the

8673

model calculation (which uses a Newton-Raphson iteration to maximize the

8674

log-likelihood function of the logistic regression model):

8675

<<>>> def show_progress(iteration, loglikelihood):

8676

print "Iteration:", iteration, "Log-likelihood function:",

8677

loglikelihood

8678

>>>

8679

>>> model = LogisticRegression.train(xs, ys, update_fn=show_progress)

8680

Iteration: 0 Log-likelihood function: -11.7835020695

8681

Iteration: 1 Log-likelihood function: -7.15886767672

8682

Iteration: 2 Log-likelihood function: -5.76877209868

8683

Iteration: 3 Log-likelihood function: -5.11362294338

8684

Iteration: 4 Log-likelihood function: -4.74870642433

8685

Iteration: 5 Log-likelihood function: -4.50026077146

8686

Iteration: 6 Log-likelihood function: -4.31127773737

8687

Iteration: 7 Log-likelihood function: -4.16015043396

8688

Iteration: 8 Log-likelihood function: -4.03561719785

8689

Iteration: 9 Log-likelihood function: -3.93073282192

8690

Iteration: 10 Log-likelihood function: -3.84087660929

8691

Iteration: 11 Log-likelihood function: -3.76282560605

8692

Iteration: 12 Log-likelihood function: -3.69425027154

8693

Iteration: 13 Log-likelihood function: -3.6334178602

8694

Iteration: 14 Log-likelihood function: -3.57900855837

8695

Iteration: 15 Log-likelihood function: -3.52999671386

8696

Iteration: 16 Log-likelihood function: -3.48557145163

8697

Iteration: 17 Log-likelihood function: -3.44508206139

8698

Iteration: 18 Log-likelihood function: -3.40799948447

8699

Iteration: 19 Log-likelihood function: -3.3738885624

8700

Iteration: 20 Log-likelihood function: -3.3423876581

8701

Iteration: 21 Log-likelihood function: -3.31319343769

8702

Iteration: 22 Log-likelihood function: -3.2860493346

8703

Iteration: 23 Log-likelihood function: -3.2607366863

8704

Iteration: 24 Log-likelihood function: -3.23706784091

8705

Iteration: 25 Log-likelihood function: -3.21488073614

8706

Iteration: 26 Log-likelihood function: -3.19403459259

8707

Iteration: 27 Log-likelihood function: -3.17440646052

8708

Iteration: 28 Log-likelihood function: -3.15588842703

8709

Iteration: 29 Log-likelihood function: -3.13838533947

8710

Iteration: 30 Log-likelihood function: -3.12181293595

8711

Iteration: 31 Log-likelihood function: -3.10609629966

8712

Iteration: 32 Log-likelihood function: -3.09116857282

8713

Iteration: 33 Log-likelihood function: -3.07696988017

8714

Iteration: 34 Log-likelihood function: -3.06344642288

8715

Iteration: 35 Log-likelihood function: -3.05054971191

8716

Iteration: 36 Log-likelihood function: -3.03823591619

8717

Iteration: 37 Log-likelihood function: -3.02646530573

8718

Iteration: 38 Log-likelihood function: -3.01520177394

8719

Iteration: 39 Log-likelihood function: -3.00441242601

8720

Iteration: 40 Log-likelihood function: -2.99406722296

8721

Iteration: 41 Log-likelihood function: -2.98413867259

8722

>>

8723

8724

The iteration stops once the increase in the log-likelihood function

8725

is less than 0.01. If no convergence is reached after 500 iterations,

8726

the 'train' function returns with an 'AssertionError'.

8727

The optional keyword 'typecode' can almost always be ignored. This

8728

keyword allows the user to choose the type of Numeric matrix to use. In

8729

particular, to avoid memory problems for very large problems, it may be

8730

necessary to use single-precision floats (Float8, Float16, etc.) rather

8731

than double, which is used by default.

8732

8733

8734

12.1.3 Using the logistic regression model for classification

8735

==============================================================

8736

8737

Classification is performed by calling the 'classify' function. Given

8738

a logistic regression model and the values for x_1 and x_2 (e.g. for a

8739

gene pair of unknown operon structure), the 'classify' function returns

8740

'1' or '0', corresponding to class OP and class NOP, respectively. For

8741

example, let's consider the gene pairs yxcE, yxcD and yxiB, yxiA:

8742

--------------------------------------------------------

8743

8744

8745

Table 12.2: Adjacent gene pairs of unknown operon status.

8746

8747

----------------------------------------------------------------

8748

| Gene pair |Intergene distance x_1|Gene expression score x_2|

8749

----------------------------------------------------------------

8750

|yxcE --- yxcD| 6 | -173.143442352 |

8751

|yxiB --- yxiA| 309 | -271.005880394 |

8752

----------------------------------------------------------------

8753

8754

8755

--------------------------------------------------------

8756

8757

The logistic regression model classifies yxcE, yxcD as belonging to

8758

the same operon (class OP), while yxiB, yxiA are predicted to belong to

8759

different operons:

8760

<<>>> print "yxcE, yxcD:", LogisticRegression.classify(model,

8761

[6,-173.143442352])

8762

yxcE, yxcD: 1

8763

>>> print "yxiB, yxiA:", LogisticRegression.classify(model, [309,

8764

-271.005880394])

8765

yxiB, yxiA: 0

8766

>>

8767

(which, by the way, agrees with the biological literature).

8768

To find out how confident we can be in these predictions, we can call

8769

the 'calculate' function to obtain the probabilities (equations (12.2)

8770

and 12.3) for class OP and NOP. For yxcE, yxcD we find

8771

<<>>> q, p = LogisticRegression.calculate(model, [6,-173.143442352])

8772

>>> print "class OP: probability =", p, "class NOP: probability =", q

8773

class OP: probability = 0.993242163503 class NOP: probability =

8774

0.00675783649744

8775

>>

8776

and for yxiB, yxiA

8777

<<>>> q, p = LogisticRegression.calculate(model, [309, -271.005880394])

8778

>>> print "class OP: probability =", p, "class NOP: probability =", q

8779

class OP: probability = 0.000321211251817 class NOP: probability =

8780

0.999678788748

8781

>>

8782

8783

To get some idea of the prediction accuracy of the logistic regression

8784

model, we can apply it to the training data:

8785

<<>>> for i in range(len(ys)):

8786

print "True:", ys[i], "Predicted:",

8787

LogisticRegression.classify(model, xs[i])

8788

True: 1 Predicted: 1

8789

True: 1 Predicted: 0

8790

True: 1 Predicted: 1

8791

True: 1 Predicted: 1

8792

True: 1 Predicted: 1

8793

True: 1 Predicted: 1

8794

True: 1 Predicted: 1

8795

True: 1 Predicted: 1

8796

True: 1 Predicted: 1

8797

True: 1 Predicted: 1

8798

True: 0 Predicted: 0

8799

True: 0 Predicted: 0

8800

True: 0 Predicted: 0

8801

True: 0 Predicted: 0

8802

True: 0 Predicted: 0

8803

True: 0 Predicted: 0

8804

True: 0 Predicted: 0

8805

>>

8806

showing that the prediction is correct for all but one of the gene

8807

pairs. A more reliable estimate of the prediction accuracy can be found

8808

from a leave-one-out analysis, in which the model is recalculated from

8809

the training data after removing the gene to be predicted:

8810

<<>>> for i in range(len(ys)):

8811

model = LogisticRegression.train(xs[:i]+xs[i+1:],

8812

ys[:i]+ys[i+1:])

8813

print "True:", ys[i], "Predicted:",

8814

LogisticRegression.classify(model, xs[i])

8815

True: 1 Predicted: 1

8816

True: 1 Predicted: 0

8817

True: 1 Predicted: 1

8818

True: 1 Predicted: 1

8819

True: 1 Predicted: 1

8820

True: 1 Predicted: 1

8821

True: 1 Predicted: 1

8822

True: 1 Predicted: 1

8823

True: 1 Predicted: 1

8824

True: 1 Predicted: 1

8825

True: 0 Predicted: 0

8826

True: 0 Predicted: 0

8827

True: 0 Predicted: 0

8828

True: 0 Predicted: 0

8829

True: 0 Predicted: 1

8830

True: 0 Predicted: 0

8831

True: 0 Predicted: 0

8832

>>

8833

The leave-one-out analysis shows that the prediction of the logistic

8834

regression model is incorrect for only two of the gene pairs, which

8835

corresponds to a prediction accuracy of 88%.

8836

8837

8838

12.1.4 Logistic Regression, Linear Discriminant Analysis, and Support

8839

======================================================================

8840

Vector Machines

8841

===============

8842

8843

The logistic regression model is similar to linear discriminant

8844

analysis. In linear discriminant analysis, the class probabilities also

8845

follow equations (12.2) and (12.3). However, instead of estimating the

8846

coefficients beta directly, we first fit a normal distribution to the

8847

predictor variables x. The coefficients beta are then calculated from

8848

the means and covariances of the normal distribution. If the

8849

distribution of x is indeed normal, then we expect linear discriminant

8850

analysis to perform better than the logistic regression model. The

8851

logistic regression model, on the other hand, is more robust to

8852

deviations from normality.

8853

Another similar approach is a support vector machine with a linear

8854

kernel. Such an SVM also uses a linear combination of the predictors,

8855

but estimates the coefficients beta from the predictor variables x near

8856

the boundary region between the classes. If the logistic regression

8857

model (equations (12.2) and (12.3)) is a good description for x away

8858

from the boundary region, we expect the logistic regression model to

8859

perform better than an SVM with a linear kernel, as it relies on more

8860

data. If not, an SVM with a linear kernel may perform better.

8861

Trevor Hastie, Robert Tibshirani, and Jerome Friedman: The Elements of

8862

Statistical Learning. Data Mining, Inference, and Prediction. Springer

8863

Series in Statistics, 2001. Chapter 4.4.

8864

8865

8866

12.2 k-Nearest Neighbors

8867

*=*=*=*=*=*=*=*=*=*=*=*=*

8868

8869

8870

8871

8872

12.2.1 Background and purpose

8873

==============================

8874

8875

The k-nearest neighbors method is a supervised learning approach that

8876

does not need to fit a model to the data. Instead, data points are

8877

classified based on the categories of the k nearest neighbors in the

8878

training data set.

8879

In Biopython, the k-nearest neighbors method is available in

8880

'Bio.kNN'. To illustrate the use of the k-nearest neighbor method in

8881

Biopython, we will use the same operon data set as in section 12.1.

8882

8883

8884

12.2.2 Initializing a k-nearest neighbors model

8885

================================================

8886

8887

Using the data in Table 12.1, we create and initialize a k-nearest

8888

neighbors model as follows:

8889

<<>>> from Bio import kNN

8890

>>> k = 3

8891

>>> model = kNN.train(xs, ys, k)

8892

>>

8893

8894

where 'xs' and 'ys' are the same as in Section 12.1.2. Here, 'k' is

8895

the number of neighbors k that will be considered for the

8896

classification. For classification into two classes, choosing an odd

8897

number for k lets you avoid tied votes. The function name 'train' is a

8898

bit of a misnomer, since no model training is done: this function simply

8899

stores 'xs', 'ys', and 'k' in 'model'.

8900

8901

8902

12.2.3 Using a k-nearest neighbors model for classification

8903

============================================================

8904

8905

To classify new data using the k-nearest neighbors model, we use the

8906

'classify' function. This function takes a data point (x_1,x_2) and

8907

finds the k-nearest neighbors in the training data set 'xs'. The data

8908

point (x_1, x_2) is then classified based on which category ('ys')

8909

occurs most among the k neighbors.

8910

For the example of the gene pairs yxcE, yxcD and yxiB, yxiA, we find:

8911

<<>>> x = [6, -173.143442352]

8912

>>> print "yxcE, yxcD:", kNN.classify(model, x)

8913

yxcE, yxcD: 1

8914

>>> x = [309, -271.005880394]

8915

>>> print "yxiB, yxiA:", kNN.classify(model, x)

8916

yxiB, yxiA: 0

8917

>>

8918

In agreement with the logistic regression model, yxcE, yxcD are

8919

classified as belonging to the same operon (class OP), while yxiB, yxiA

8920

are predicted to belong to different operons.

8921

The 'classify' function lets us specify both a distance function and a

8922

weight function as optional arguments. The distance function affects

8923

which k neighbors are chosen as the nearest neighbors, as these are

8924

defined as the neighbors with the smallest distance to the query point

8925

(x, y). By default, the Euclidean distance is used. Instead, we could

8926

for example use the city-block (Manhattan) distance:

8927

<<>>> def cityblock(x1, x2):

8928

... assert len(x1)==2

8929

... assert len(x2)==2

8930

... distance = abs(x1[0]-x2[0]) + abs(x1[1]-x2[1])

8931

... return distance

8932

...

8933

>>> x = [6, -173.143442352]

8934

>>> print "yxcE, yxcD:", kNN.classify(model, x, distance_fn =

8935

cityblock)

8936

yxcE, yxcD: 1

8937

>>

8938

8939

The weight function can be used for weighted voting. For example, we

8940

may want to give closer neighbors a higher weight than neighbors that

8941

are further away:

8942

<<>>> def weight(x1, x2):

8943

... assert len(x1)==2

8944

... assert len(x2)==2

8945

... return exp(-abs(x1[0]-x2[0]) - abs(x1[1]-x2[1]))

8946

...

8947

>>> x = [6, -173.143442352]

8948

>>> print "yxcE, yxcD:", kNN.classify(model, x, weight_fn = weight)

8949

yxcE, yxcD: 1

8950

>>

8951

By default, all neighbors are given an equal weight.

8952

To find out how confident we can be in these predictions, we can call

8953

the 'calculate' function, which will calculate the total weight assigned

8954

to the classes OP and NOP. For the default weighting scheme, this

8955

reduces to the number of neighbors in each category. For yxcE, yxcD, we

8956

find

8957

<<>>> x = [6, -173.143442352]

8958

>>> weight = kNN.calculate(model, x)

8959

>>> print "class OP: weight =", weight[0], "class NOP: weight =",

8960

weight[1]

8961

class OP: weight = 0.0 class NOP: weight = 3.0

8962

>>

8963

which means that all three neighbors of 'x1', 'x2' are in the NOP

8964

class. As another example, for yesK, yesL we find

8965

<<>>> x = [117, -267.14]

8966

>>> weight = kNN.calculate(model, x)

8967

>>> print "class OP: weight =", weight[0], "class NOP: weight =",

8968

weight[1]

8969

class OP: weight = 2.0 class NOP: weight = 1.0

8970

>>

8971

which means that two neighbors are operon pairs and one neighbor is a

8972

non-operon pair.

8973

To get some idea of the prediction accuracy of the k-nearest neighbors

8974

approach, we can apply it to the training data:

8975

<<>>> for i in range(len(ys)):

8976

print "True:", ys[i], "Predicted:", kNN.classify(model, xs[i])

8977

True: 1 Predicted: 1

8978

True: 1 Predicted: 0

8979

True: 1 Predicted: 1

8980

True: 1 Predicted: 1

8981

True: 1 Predicted: 1

8982

True: 1 Predicted: 1

8983

True: 1 Predicted: 1

8984

True: 1 Predicted: 1

8985

True: 1 Predicted: 1

8986

True: 1 Predicted: 0

8987

True: 0 Predicted: 0

8988

True: 0 Predicted: 0

8989

True: 0 Predicted: 0

8990

True: 0 Predicted: 0

8991

True: 0 Predicted: 0

8992

True: 0 Predicted: 0

8993

True: 0 Predicted: 0

8994

>>

8995

showing that the prediction is correct for all but two of the gene

8996

pairs. A more reliable estimate of the prediction accuracy can be found

8997

from a leave-one-out analysis, in which the model is recalculated from

8998

the training data after removing the gene to be predicted:

8999

<<>>> for i in range(len(ys)):

9000

model = kNN.train(xs[:i]+xs[i+1:], ys[:i]+ys[i+1:])

9001

print "True:", ys[i], "Predicted:", kNN.classify(model, xs[i])

9002

True: 1 Predicted: 1

9003

True: 1 Predicted: 0

9004

True: 1 Predicted: 1

9005

True: 1 Predicted: 1

9006

True: 1 Predicted: 1

9007

True: 1 Predicted: 1

9008

True: 1 Predicted: 1

9009

True: 1 Predicted: 1

9010

True: 1 Predicted: 1

9011

True: 1 Predicted: 0

9012

True: 0 Predicted: 0

9013

True: 0 Predicted: 0

9014

True: 0 Predicted: 1

9015

True: 0 Predicted: 0

9016

True: 0 Predicted: 0

9017

True: 0 Predicted: 0

9018

True: 0 Predicted: 1

9019

>>

9020

The leave-one-out analysis shows that k-nearest neighbors model is

9021

correct for 13 out of 17 gene pairs, which corresponds to a prediction

9022

accuracy of 76%.

9023

9024

9025

12.3 Naive Bayes

9026

*=*=*=*=*=*=*=*=*

9027

9028

9029

This section will describe the 'Bio.NaiveBayes' module.

9030

9031

9032

12.4 Maximum Entropy

9033

*=*=*=*=*=*=*=*=*=*=*

9034

9035

9036

This section will describe the 'Bio.MaximumEntropy' module.

9037

9038

9039

12.5 Markov Models

9040

*=*=*=*=*=*=*=*=*=*

9041

9042

9043

This section will describe the 'Bio.MarkovModel' and/or

9044

'Bio.HMM.MarkovModel' modules.

9045

9046

9047

Chapter 13 Graphics including GenomeDiagram

9048

**********************************************

9049

9050

The 'Bio.Graphics' module depends on the third party Python library

9051

ReportLab (1). Although focused on producing PDF files, ReportLab can

9052

also create encapsulated postscript (EPS) and (SVG) files. In addition

9053

to these vector based images, provided certain further dependencies such

9054

as the Python Imaging Library (PIL) (2) are installed, ReportLab can

9055

also output bitmap images (including JPEG, PNG, GIF, BMP and PICT

9056

formats).

9057

9058

9059

13.1 GenomeDiagram

9060

*=*=*=*=*=*=*=*=*=*

9061

9062

9063

9064

13.1.1 Introduction

9065

====================

9066

9067

The 'Bio.Graphics.GenomeDiagram' module is a new addition to Biopython

9068

1.50, having previously been available as a separate Python module

9069

dependent on Biopython. GenomeDiagram is described in the Bioinformatics

9070

journal publication Pritchard et al. (2006),

9071

doi:10.1093/bioinformatics/btk021 (3), and there are several examples

9072

images and documentation for the old separate version available at

9073

http://bioinf.scri.ac.uk/lp/programs.php#genomediagram.

9074

As the name might suggest, GenomeDiagram was designed for drawing

9075

whole genomes, in particular prokaryotic genomes, either as linear

9076

diagrams (optionally broken up into fragments to fit better) or as

9077

circular wheel diagrams. It proves also well suited to drawing quite

9078

detailed figures for smaller genomes such as phage, plasmids or

9079

mitochrondia.

9080

This module is easiest to use if you have your genome loaded as a

9081

'SeqRecord' object containing lots of 'SeqFeature' objects - for example

9082

as loaded from a GenBank file (see Chapters 4 and 5).

9083

9084

9085

13.1.2 Diagrams, tracks, feature-sets and features

9086

===================================================

9087

9088

GenomeDiagram uses a nested set of objects. At the top level, you have

9089

a diagram object representing a sequence (or sequence region) along the

9090

horizontal axis (or circle). A diagram can contain one or more tracks,

9091

shown stacked vertically (or radially on circular diagrams). These will

9092

all have the same length and represent the same sequence region. You

9093

might use one track to show the gene locations, another to show

9094

regulatory regions, and a third track to show the GC percentage.

9095

The most commonly used type of track will contain features, bundled

9096

together in feature-sets. You might choose to use one feature-set for

9097

all your CDS features, and another for tRNA features. This isn't

9098

required - they can all go in the same feature-set, but it makes it

9099

easier to update the properties of just selected features (e.g. make all

9100

the tRNA features red).

9101

There are two main ways to build up a complete diagram. Firstly, the

9102

top down approach where you create a diagram object, and then using its

9103

methods add track(s), and use the track methods to add feature-set(s),

9104

and use their methods to add the features. Secondly, you can create the

9105

individual objects separately (in whatever order suits your code), and

9106

then combine them.

9107

9108

9109

13.1.3 A top down example

9110

==========================

9111

9112

We're going to draw a whole genome from a 'SeqRecord' object read in

9113

from a GenBank file (see Chapter 5). This example uses the pPCP1 plasmid

9114

from Yersinia pestis biovar Microtus, the file is included with the

9115

Biopython unit tests under the GenBank folder, or online

9116

NC_005816.gb (4) from our website.

9117

<<from reportlab.lib import colors

9118

from reportlab.lib.units import cm

9119

from Bio.Graphics import GenomeDiagram

9120

from Bio import SeqIO

9121

record = SeqIO.read(open("NC_005816.gb"), "genbank")

9122

>>

9123

9124

We're using a top down approach, so after loading in our sequence we

9125

next create an empty diagram, then add an (empty) track, and to that add

9126

an (empty) feature set:

9127

<<gd_diagram = GenomeDiagram.Diagram("Yersinia pestis biovar Microtus

9128

plasmid pPCP1")

9129

gd_track_for_features = gd_diagram.new_track(1, name="Annotated

9130

Features")

9131

gd_feature_set = gd_track_for_features.new_set()

9132

>>

9133

9134

Now the fun part - we take each gene 'SeqFeature' object in our

9135

'SeqRecord', and use it to generate a feature on the diagram. We're

9136

going to color them blue, alternating between a dark blue and a light

9137

blue.

9138

<<for feature in record.features:

9139

if feature.type != "gene" :

9140

#Exclude this feature

9141

continue

9142

if len(gd_feature_set) % 2 == 0 :

9143

color = colors.blue

9144

else :

9145

color = colors.lightblue

9146

gd_feature_set.add_feature(feature, color=color, label=True)

9147

>>

9148

9149

Now we come to actually making the output file. This happens in two

9150

steps, first we call the 'draw' method, which creates all the shapes

9151

using ReportLab objects. Then we call the 'write' method which renders

9152

these to the requested file format. Note you can output in multiple file

9153

formats:

9154

<<gd_diagram.draw(format="linear", orientation="landscape",

9155

pagesize='A4',

9156

fragments=4, start=0, end=len(record))

9157

gd_diagram.write("plasmid_linear.pdf", "PDF")

9158

gd_diagram.write("plasmid_linear.eps", "EPS")

9159

gd_diagram.write("plasmid_linear.svg", "SVG")

9160

>>

9161

9162

Also, provided you have the dependencies installed, you can also do

9163

bitmaps, for example:

9164

<<gd_diagram.write("plasmid_linear.png", "PNG")

9165

>>

9166

9167

*images/plasmid_linear.png*

9168

Notice that the 'fragments' argument which we set to four controls

9169

how many pieces the genome gets broken up into.

9170

If you want to do a circular figure, then try this:

9171

<<gd_diagram.move_track(1,3) # move track to make an empty space in the

9172

middle

9173

gd_diagram.draw(format="circular", circular=True,

9174

pagesize=(20*cm,20*cm),

9175

start=0, end=len(record))

9176

gd_diagram.write("plasmid_circular.pdf", "PDF")

9177

>>

9178

9179

*images/plasmid_circular.png*

9180

These figures are not very exciting, but we've only just got

9181

started.

9182

9183

9184

13.1.4 A bottom up example

9185

===========================

9186

Now let's produce exactly the same figures, but using the bottom up

9187

approach. This means we create the different objects directly (and this

9188

can be done in almost any order) and then combine them.

9189

<<from reportlab.lib import colors

9190

from reportlab.lib.units import cm

9191

from Bio.Graphics import GenomeDiagram

9192

from Bio import SeqIO

9193

record = SeqIO.read(open("NC_005816.gb"), "genbank")

9194

9195

#Create the feature set and its feature objects,

9196

gd_feature_set = GenomeDiagram.FeatureSet()

9197

for feature in record.features:

9198

if feature.type != "gene" :

9199

#Exclude this feature

9200

continue

9201

if len(gd_feature_set) % 2 == 0 :

9202

color = colors.blue

9203

else :

9204

color = colors.lightblue

9205

gd_feature_set.add_feature(feature, color=color, label=True)

9206

#(this for loop is the same as in the previous example)

9207

9208

#Create a track, and a diagram

9209

gd_track_for_features = GenomeDiagram.Track(name="Annotated Features")

9210

gd_diagram = GenomeDiagram.Diagram("Yersinia pestis biovar Microtus

9211

plasmid pPCP1")

9212

9213

#Now have to glue the bits together...

9214

gd_track_for_features.add_set(gd_feature_set)

9215

gd_diagram.add_track(gd_track_for_features, 1)

9216

>>

9217

9218

You can now call the 'draw' and 'write' methods as before to produce a

9219

linear or circular diagram, using the code at the end of the top-down

9220

example above. The figures should be identical.

9221

9222

9223

13.1.5 Features without a SeqFeature

9224

=====================================

9225

9226

In the above example we used a 'SeqRecord''s 'SeqFeature' objects to

9227

build our diagram (see also Section 4.3). Sometimes you won't have

9228

'SeqFeature' objects, but just the coordinates for a feature you want to

9229

draw. You have to create minimal 'SeqFeature' object, but this is easy:

9230

<<from Bio.SeqFeature import SeqFeature, FeatureLocation

9231

my_seq_feature = SeqFeature(FeatureLocation(50,100),strand=+1)

9232

>>

9233

9234

For strand, use +1 for the forward strand, -1 for the reverse strand,

9235

and None for both. Here is a short self contained example:

9236

<<from Bio.SeqFeature import SeqFeature, FeatureLocation

9237

from Bio.Graphics import GenomeDiagram

9238

from reportlab.lib.units import cm

9239

9240

gdd = GenomeDiagram.Diagram('Test Diagram')

9241

gdt_features = gdd.new_track(1, greytrack=False)

9242

gds_features = gdt_features.new_set()

9243

9244

#Add three features to show the strand options,

9245

feature = SeqFeature(FeatureLocation(25, 125), strand=+1)

9246

gds_features.add_feature(feature, name="Forward", label=True)

9247

feature = SeqFeature(FeatureLocation(150, 250), strand=None)

9248

gds_features.add_feature(feature, name="Standless", label=True)

9249

feature = SeqFeature(FeatureLocation(275, 375), strand=-1)

9250

gds_features.add_feature(feature, name="Reverse", label=True)

9251

9252

gdd.draw(format='linear', pagesize=(15*cm,4*cm), fragments=1,

9253

start=0, end=400)

9254

gdd.write("GD_labels_default.pdf", "pdf")

9255

>>

9256

9257

The top part of the image in the next subsection shows the output

9258

(in the default feature color, pale green).

9259

Notice that we have used the name argument here to specify the caption

9260

text for these features. This is discussed in more detail next.

9261

9262

9263

13.1.6 Feature captions

9264

========================

9265

9266

Recall we used the following (where feature was a 'SeqFeature' object)

9267

to add a feature to the diagram:

9268

<<gd_feature_set.add_feature(feature, color=color, label=True)

9269

>>

9270

9271

In the example above the 'SeqFeature' annotation was used to pick a

9272

sensible caption for the features. By default the following possible

9273

entries under the 'SeqFeature' object's qualifiers dictionary are used:

9274

gene, label, name, locus_tag, and product. More simply, you can specify

9275

a name directly:

9276

<<gd_feature_set.add_feature(feature, color=color, label=True, name="My

9277

Gene")

9278

>>

9279

9280

In addition to the caption text for each feature's label, you can also

9281

choose the font, position (this defaults to the start of the sigil, you

9282

can also choose the middle or at the end) and orientation (for linear

9283

diagrams only, where this defaults to rotated by 45 degrees):

9284

<<#Large font, parallel with the track

9285

gd_feature_set.add_feature(feature, label=True, color="green",

9286

label_size=25, label_angle=0)

9287

9288

#Very small font, perpendicular to the track (towards it)

9289

gd_feature_set.add_feature(feature, label=True, color="purple",

9290

label_position="end",

9291

label_size=4, label_angle=90)

9292

9293

#Small font, perpendicular to the track (away from it)

9294

gd_feature_set.add_feature(feature, label=True, color="blue",

9295

label_position="middle",

9296

label_size=6, label_angle=-90)

9297

>>

9298

9299

Combining each of these three fragments with the complete example in

9300

the previous section should give something like this:

9301

*images/GD_sigil_labels.png*

9302

9303

We've not shown it here, but you can also set label_color to control

9304

the label's color (used in Section 13.1.8).

9305

You'll notice the default font is quite small - this makes sense

9306

because you will usually be drawing many (small) features on a page, not

9307

just a few large ones as shown here.

9308

9309

9310

13.1.7 Feature sigils

9311

======================

9312

9313

The examples above have all just used the default sigil for the

9314

feature, a plain box, but you can also use arrows instead. Note this

9315

wasn't available in the last publically released standalone version of

9316

GenomeDiagram.

9317

<<#Default uses a BOX sigil

9318

gd_feature_set.add_feature(feature)

9319

9320

#You can make this explicit:

9321

gd_feature_set.add_feature(feature, sigil="BOX")

9322

9323

#Or opt for an arrow:

9324

gd_feature_set.add_feature(feature, sigil="ARROW")

9325

>>

9326

9327

The default arrows are show at the top of the next two images. The

9328

arrows fit into a bounding box (as given by the default BOX sigil).

9329

There are two additional options to adjust the shapes of the arrows,

9330

firstly the thickness of the arrow shaft, given as a proportion of the

9331

height of the bounding box:

9332

<<#Full height shafts, giving pointed boxes:

9333

gd_feature_set.add_feature(feature, sigil="ARROW", color="brown",

9334

arrowshaft_height=1.0)

9335

#Or, thin shafts:

9336

gd_feature_set.add_feature(feature, sigil="ARROW", color="teal",

9337

arrowshaft_height=0.2)

9338

#Or, very thin shafts:

9339

gd_feature_set.add_feature(feature, sigil="ARROW", color="darkgreen",

9340

arrowshaft_height=0.1)

9341

>>

9342

9343

The results are shown below:

9344

*images/GD_sigil_arrow_shafts.png*

9345

9346

Secondly, the length of the arrow head - given as a proportion of the

9347

height of the bounding box (defaulting to 0.5, or 50%):

9348

<<#Short arrow heads:

9349

gd_feature_set.add_feature(feature, sigil="ARROW", color="blue",

9350

arrowhead_length=0.25)

9351

#Or, longer arrow heads:

9352

gd_feature_set.add_feature(feature, sigil="ARROW", color="orange",

9353

arrowhead_length=1)

9354

#Or, very very long arrow heads (i.e. all head, no shaft, so

9355

triangles):

9356

gd_feature_set.add_feature(feature, sigil="ARROW", color="red",

9357

arrowhead_length=10000)

9358

>>

9359

9360

The results are shown below:

9361

*images/GD_sigil_arrow_heads.png*

9362

9363

9364

9365

13.1.8 A nice example

9366

======================

9367

9368

Now let's return to the pPCP1 plasmid from Yersinia pestis biovar

9369

Microtus, and the top down approach used in Section 13.1.3, but take

9370

advantage of the sigil options we've now discussed. This time we'll use

9371

arrows for the genes, and overlay them with strandless features (as

9372

plain boxes) showing the position of some restriction digest sites.

9373

<<from reportlab.lib import colors

9374

from Bio.Graphics import GenomeDiagram

9375

from Bio import SeqIO

9376

from Bio.SeqFeature import SeqFeature, FeatureLocation

9377

9378

record = SeqIO.read(open("NC_005816.gb"), "genbank")

9379

9380

gd_diagram = GenomeDiagram.Diagram(record.id)

9381

gd_track_for_features = gd_diagram.new_track(1, name="Annotated

9382

Features")

9383

gd_feature_set = gd_track_for_features.new_set()

9384

9385

for feature in record.features:

9386

if feature.type != "gene" :

9387

#Exclude this feature

9388

continue

9389

if len(gd_feature_set) % 2 == 0 :

9390

color = colors.blue

9391

else :

9392

color = colors.lightblue

9393

gd_feature_set.add_feature(feature, sigil="ARROW",

9394

color=color, label=True,

9395

label_size = 14, label_angle=0)

9396

9397

#I want to include some strandless features, so for an example

9398

#will use EcoRI recognition sites etc.

9399

for site, name, color in [("GAATTC","EcoRI",colors.green),

9400

("CCCGGG","SmaI",colors.orange),

9401

("AAGCTT","HindIII",colors.red),

9402

("GGATCC","BamHI",colors.purple)] :

9403

index = 0

9404

while True :

9405

index = record.seq.find(site, start=index)

9406

if index == -1 : break

9407

feature = SeqFeature(FeatureLocation(index, index+len(site)))

9408

gd_feature_set.add_feature(feature, color=color, name=name,

9409

label=True, label_size = 10,

9410

label_color=color)

9411

index += len(site)

9412

9413

gd_diagram.draw(format="linear", pagesize='A4', fragments=4,

9414

start=0, end=len(record))

9415

gd_diagram.write("plasmid_linear_nice.pdf", "PDF")

9416

gd_diagram.write("plasmid_linear_nice.eps", "EPS")

9417

gd_diagram.write("plasmid_linear_nice.svg", "SVG")

9418

>>

9419

9420

And the output:

9421

*images/plasmid_linear_nice.png*

9422

9423

9424

9425

13.1.9 Further options

9426

=======================

9427

9428

All the examples so far have used a single track, but you can have

9429

more than one track -- for example show the genes on one, and repeat

9430

regions on another. You can also enable tick marks to show the scale --

9431

after all every graph should show its units.

9432

Also, we have only used the 'FeatureSet' so far. GenomeDiagram also

9433

has a 'GraphSet' which can be used for show line graphs, bar charts and

9434

heat plots (e.g. to show plots of GC% on a track parallel to the

9435

features).

9436

These options are not covered here yet, so for now we refer you to the

9437

User Guide (PDF) (5) included with the standalone version of

9438

GenomeDiagram (but please read the next section first), and the

9439

docstrings.

9440

9441

9442

13.1.10 Converting old code

9443

============================

9444

9445

If you have old code written using the standalone version of

9446

GenomeDiagram, and you want to switch it over to using the new version

9447

included with Biopython then you will have to make a few changes - most

9448

importantly to your import statements.

9449

Also, the older version of GenomeDiagram used only the UK spellings of

9450

color and center (colour and centre). As part of the integration into

9451

Biopython, both forms can now be used for argument names. However, at

9452

some point in the future the UK spellings may be deprecated.

9453

For example, if you used to have:

9454

<<from GenomeDiagram import GDFeatureSet, GDDiagram

9455

gdd = GDDiagram("An example")

9456

...

9457

>>

9458

you could just switch the import statements like this:

9459

<<from Bio.Graphics.GenomeDiagram import FeatureSet as GDFeatureSet,

9460

Diagram as GDDiagram

9461

gdd = GDDiagram("An example")

9462

...

9463

>>

9464

and hopefully that should be enough. In the long term you might want

9465

to switch to the new names, but you would have to change more of your

9466

code:

9467

<<from Bio.Graphics.GenomeDiagram import FeatureSet, Diagram

9468

gdd = Diagram("An example")

9469

...

9470

>>

9471

or:

9472

<<from Bio.Graphics import GenomeDiagram

9473

gdd = GenomeDiagram.Diagram("An example")

9474

...

9475

>>

9476

9477

If you run into difficulties, please ask on the Biopython mailing list

9478

for advice. One catch is that for Biopython 1.50, we have not yet

9479

included the old module 'GenomeDiagram.GDUtilities' yet. This included a

9480

number of GC% related functions, which will probably be merged under

9481

'Bio.SeqUtils' later on.

9482

9483

9484

13.2 Chromosomes

9485

*=*=*=*=*=*=*=*=*

9486

9487

9488

The 'Bio.Graphics.BasicChromosome' module allows drawing of simple

9489

chromosomes. Here is a very simple example - for which we'll use

9490

Arabidopsis thaliana.

9491

I first downloaded the five sequenced chromosomes from the NCBI's FTP

9492

site ftp://ftp.ncbi.nlm.nih.gov/genomes/Arabidopsis_thaliana and then

9493

parsed them with 'Bio.SeqIO' to find out their lengths. You could use

9494

the GenBank files for this, but it is faster to use the FASTA files for

9495

the whole chromosomes:

9496

<<from Bio import SeqIO

9497

entries = [("Chr I","CHR_I/NC_003070.fna"),

9498

("Chr II","CHR_II/NC_003071.fna"),

9499

("Chr III","CHR_III/NC_003074.fna"),

9500

("Chr IV","CHR_IV/NC_003075.fna"),

9501

("Chr V","CHR_V/NC_003076.fna")]

9502

for (name, filename) in entries :

9503

record = SeqIO.read(open(filename),"fasta")

9504

print name, len(record)

9505

>>

9506

9507

This gave the lengths of the five chromosomes, which we'll now use in

9508

the following short demonstration of the 'BasicChromosome' module:

9509

<<from Bio.Graphics import BasicChromosome

9510

9511

entries = [("Chr I", 30432563),

9512

("Chr II", 19705359),

9513

("Chr III", 23470805),

9514

("Chr IV", 18585042),

9515

("Chr V", 26992728)]

9516

9517

max_length = max([length for name, length in entries])

9518

9519

chr_diagram = BasicChromosome.Organism()

9520

for name, length in entries :

9521

cur_chromosome = BasicChromosome.Chromosome(name)

9522

#Set the length, adding and extra 20 percent for the tolomeres:

9523

cur_chromosome.scale_num = max_length * 1.2

9524

9525

#Add an opening telomere

9526

start = BasicChromosome.TelomereSegment()

9527

start.scale = 0.1 * max_length

9528

cur_chromosome.add(start)

9529

9530

#Add a body - using bp as the scale length here.

9531

body = BasicChromosome.ChromosomeSegment()

9532

body.scale = length

9533

cur_chromosome.add(body)

9534

9535

#Add a closing telomere

9536

end = BasicChromosome.TelomereSegment(inverted=True)

9537

end.scale = 0.1 * max_length

9538

cur_chromosome.add(end)

9539

9540

#This chromosome is done

9541

chr_diagram.add(cur_chromosome)

9542

9543

chr_diagram.draw("simple_chrom.pdf", "Arabidopsis thaliana")

9544

>>

9545

9546

This should create a very simple PDF file, shown here:

9547

*images/simple_chrom.png*

9548

This example is deliberately short and sweet. One thing you might

9549

want to try is showing the location of features of interest - perhaps

9550

SNPs or genes. Currently the 'ChromosomeSegment' object doesn't support

9551

sub-segments which would be one approach. Instead, you must replace the

9552

single large segment with lots of smaller segments, maybe white ones for

9553

the boring regions, and colored ones for the regions of interest.

9554

-----------------------------------

9555

9556

9557

(1) http://www.reportlab.org

9558

9559

(2) http://www.pythonware.com/products/pil/

9560

9561

(3) http://dx.doi.org/10.1093/bioinformatics/btk021

9562

9563

(4) http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.gb

9564

9565

(5) http://bioinf.scri.ac.uk/lp/downloads/programs/genomediagram/usergu

9566

ide.pdf

9567

9568

9569

Chapter 14 Cookbook -- Cool things to do with it

9570

***************************************************

9571

9572

Biopython now has two collections of "cookbook" examples -- this

9573

chapter (which has been included in this tutorial for many years and has

9574

gradually grown), and http://biopython.org/wiki/Category:Cookbook which

9575

is a user contributed collection on our wiki.

9576

We're trying to encourage Biopython users to contribute their own

9577

examples to the wiki. In addition to helping the community, one direct

9578

benefit of sharing an example like this is that you could also get some

9579

feedback on the code from other Biopython users and developers - which

9580

could help you improve all your Python code.

9581

In the long term, we may end up moving all of the examples in this

9582

chapter to the wiki, or elsewhere within the tutorial.

9583

9584

9585

14.1 Working with sequence files

9586

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

9587

9588

9589

This section shows some more examples of sequence input/output, using

9590

the 'Bio.SeqIO' module described in Chapter 5.

9591

9592

9593

14.1.1 Producing randomised genomes

9594

====================================

9595

9596

Let's suppose you are looking at genome sequence, hunting for some

9597

sequence feature -- maybe extreme local GC% bias, or possible

9598

restriction digest sites. Once you've got your Python code working on

9599

the real genome it may be sensible to try running the same search on

9600

randomised versions of the same genome for statistical analysis (after

9601

all, any "features" you've found could just be there just by chance).

9602

For this discussion, we'll use the GenBank file for the pPCP1 plasmid

9603

from Yersinia pestis biovar Microtus. The file is included with the

9604

Biopython unit tests under the GenBank folder, or you can get it from

9605

our website, NC_005816.gb (1). This file contains one and only one

9606

record, so we can read it in as a 'SeqRecord' using the

9607

'Bio.SeqIO.read()' function:

9608

<<from Bio import SeqIO

9609

original_rec = SeqIO.read(open("NC_005816.gb"),"genbank")

9610

>>

9611

9612

So, how can we generate a shuffled versions of the original sequence?

9613

I would use the built in Python 'random' module for this, in particular

9614

the function 'random.shuffle' -- but this works on a Python list. Our

9615

sequence is a 'Seq' object, so in order to shuffle it we need to turn it

9616

into a list:

9617

<<import random

9618

nuc_list = list(original_rec.seq)

9619

random.shuffle(nuc_list) #acts in situ!

9620

>>

9621

9622

Now, in order to use 'Bio.SeqIO' to output the shuffled sequence, we

9623

need to construct a new 'SeqRecord' with a new 'Seq' object using this

9624

shuffled list. In order to do this, we need to turn the list of

9625

nucleotides (single letter strings) into a long string -- the standard

9626

Python way to do this is with the string object's join method.

9627

<<from Bio.Seq import Seq

9628

from Bio.SeqRecord import SeqRecord

9629

shuffled_rec = SeqRecord(Seq("".join(nuc_list),

9630

original_rec.seq.alphabet), \

9631

id="Shuffled", description="Based on %s" %

9632

original_rec.id)

9633

>>

9634

9635

Let's put all these pieces together to make a complete Python script

9636

which generates a single FASTA file containing 30 randomly shuffled

9637

versions of the original sequence.

9638

This first version just uses a big for loop and writes out the records

9639

one by one (using the 'SeqRecord''s format method described in

9640

Section 5.4.3):

9641

<<import random

9642

from Bio.Seq import Seq

9643

from Bio.SeqRecord import SeqRecord

9644

from Bio import SeqIO

9645

9646

original_rec = SeqIO.read(open("NC_005816.gb"),"genbank")

9647

9648

handle = open("shuffled.fasta", "w")

9649

for i in range(30) :

9650

nuc_list = list(original_rec.seq)

9651

random.shuffle(nuc_list)

9652

shuffled_rec = SeqRecord(Seq("".join(nuc_list),

9653

original_rec.seq.alphabet), \

9654

id="Shuffled%i" % (i+1), \

9655

description="Based on %s" %

9656

original_rec.id)

9657

handle.write(shuffled_rec.format("fasta"))

9658

handle.close()

9659

>>

9660

9661

Personally I prefer the following version using a function to shuffle

9662

the record and a generator expression instead of the for loop:

9663

<<import random

9664

from Bio.Seq import Seq

9665

from Bio.SeqRecord import SeqRecord

9666

from Bio import SeqIO

9667

9668

def make_shuffle_record(record, new_id) :

9669

nuc_list = list(record.seq)

9670

random.shuffle(nuc_list)

9671

return SeqRecord(Seq("".join(nuc_list), record.seq.alphabet), \

9672

id=new_id, description="Based on %s" % original_rec.id)

9673

9674

original_rec = SeqIO.read(open("NC_005816.gb"),"genbank")

9675

shuffled_recs = (make_shuffle_record(original_rec, "Shuffled%i" %

9676

(i+1)) \

9677

for i in range(30))

9678

handle = open("shuffled.fasta", "w")

9679

SeqIO.write(shuffled_recs, handle, "fasta")

9680

handle.close()

9681

>>

9682

9683

9684

9685

14.1.2 Translating a FASTA file of CDS entries

9686

===============================================

9687

Suppose you've got an input file of CDS entries for some organism,

9688

and you want to generate a new FASTA file containing their protein

9689

sequences. i.e. Take each nucleotide sequence from the original file,

9690

and translate it. Back in Section 3.8 we saw how to use the 'Seq'

9691

object's 'translate method', and the optional 'cds' argument which

9692

enables correct translation of alternative start codons.

9693

We can combine this with 'Bio.SeqIO' as shown in the reverse

9694

complement example in Section 5.4.2. The key point is that for each

9695

nucleotide 'SeqRecord', we need to create a protein 'SeqRecord' - and

9696

take care of naming it.

9697

You can write you own function to do this, choosing suitable protein

9698

identifiers for your sequences, and the appropriate genetic code. In

9699

this example we just use the default table and add a prefix to the

9700

identifier:

9701

<<from Bio.SeqRecord import SeqRecord

9702

def make_protein_record(nuc_record) :

9703

"""Returns a new SeqRecord with the translated sequence (default

9704

table)."""

9705

return SeqRecord(seq = nuc_record.seq.translate(cds=True), \

9706

id = "trans_" + nuc_record.id, \

9707

description = "translation of CDS, using default

9708

table")

9709

>>

9710

9711

We can then use this function to turn the input nucleotide records

9712

into protein records ready for output. An elegant way and memory

9713

efficient way to do this is with a generator expression:

9714

<<from Bio import SeqIO

9715

9716

proteins = (make_protein_record(nuc_rec) for nuc_rec in \

9717

SeqIO.parse(open("coding_sequences.fasta"), "fasta"))

9718

9719

out_handle = open("translations.fasta", "w")

9720

SeqIO.write(proteins, out_handle, "fasta")

9721

out_handle.close()

9722

>>

9723

9724

This should work on any FASTA file of complete coding sequences. If

9725

you are working on partial coding sequences, you may prefer to use

9726

'nuc_record.seq.translate(to_stop=True)' in the example above, as this

9727

wouldn't check for a valid start codon etc.

9728

9729

9730

14.1.3 Simple quality filtering for FASTQ files

9731

================================================

9732

9733

The FASTQ file format was introduced at Sanger and is now widely used

9734

for holding nucleotide sequencing reads together with their quality

9735

scores. FASTQ files (and the related QUAL files) are an excellent

9736

example of per-letter-annotation, because for each nucleotide in the

9737

sequence there is an associated quality score. Any per-letter-annotation

9738

is held in a 'SeqRecord' in the 'letter_annotations' dictionary as a

9739

list, tuple or string (with the same number of elements as the sequence

9740

length).

9741

One common task is taking a large set of sequencing reads and

9742

filtering them (or cropping them) based on their quality scores. The

9743

following example is very simplistic, but should illustrate the basics

9744

of working with quality data in a 'SeqRecord' object. All we are going

9745

to do here is read in a file of FASTQ data, and filter it to pick out

9746

only those records whose PHRED quality scores are all above some

9747

threshold (here 20).

9748

For this example we'll use some real data downloaded from the NCBI,

9749

ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX003/SRX003639/SRR014849.fastq.g

9750

z (8MB) which unzips to a 23MB file SRR014849.fastq.

9751

<<from Bio import SeqIO

9752

9753

good_reads = (record for record in \

9754

SeqIO.parse(open("SRR014849.fastq"), "fastq") \

9755

if min(record.letter_annotations["phred_quality"]) >=

9756

20)

9757

9758

out_handle = open("good_quality.fastq", "w")

9759

count = SeqIO.write(good_reads, out_handle, "fastq")

9760

out_handle.close()

9761

print "Saved %i reads" % count

9762

>>

9763

9764

This pulled out only 412 reads - maybe this dataset hasn't been

9765

quality trimmed yet?

9766

FASTQ files can contain millions of entries, so it is best to avoid

9767

loading them all into memory at once. This example uses a generator

9768

expression, which means only one 'SeqRecord' is created at a time -

9769

avoiding any memory limitations.

9770

9771

9772

14.1.4 Trimming off primer sequences

9773

=====================================

9774

9775

For this example we're going to pretend that GTTGGAACCG is a 3' primer

9776

sequence we want to look for in some FASTQ formatted read data. As in

9777

the example above, we'll use the SRR014849.fastq file downloaded from

9778

the NCBI

9779

(ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX003/SRX003639/SRR014849.fastq.

9780

gz). The same approach would work with any other supported file format

9781

(e.g. FASTA files).

9782

This code uses 'Bio.SeqIO' with a generator expression (to avoid

9783

loading all the sequences into memory at once), and the 'Seq' object's

9784

'startswith' method to see if the read starts with the primer sequence:

9785

<<from Bio import SeqIO

9786

primer_reads = (record for record in \

9787

SeqIO.parse(open("SRR014849.fastq"), "fastq") \

9788

if record.seq.startswith("GTTGGAACCG"))

9789

out_handle = open("with_primer.fastq", "w")

9790

count = SeqIO.write(primer_reads, out_handle, "fastq")

9791

out_handle.close()

9792

print "Saved %i reads" % count

9793

>>

9794

9795

That should find 500 reads from SRR014849.fastq and save them to a new

9796

FASTQ file, with_primer.fastq.

9797

Now suppose that instead you wanted to make a FASTQ file containing

9798

these 500 reads but with the primer sequence removed? That's just a

9799

small change as we can slice the 'SeqRecord' (see Section 4.6) to remove

9800

the first ten letters (the length of our primer):

9801

<<from Bio import SeqIO

9802

trimmed_primer_reads = (record[10:] for record in \

9803

SeqIO.parse(open("SRR014849.fastq"), "fastq")

9804

\

9805

if record.seq.startswith("GTTGGAACCG"))

9806

out_handle = open("with_primer_trimmed.fastq", "w")

9807

count = SeqIO.write(trimmed_primer_reads, out_handle, "fastq")

9808

out_handle.close()

9809

print "Saved %i reads" % count

9810

>>

9811

9812

Again, that should pull out the 500 reads from SRR014849.fastq, but

9813

this time strip off the first ten characters, and save them to another

9814

new FASTQ file, with_primer_trimmed.fastq.

9815

Finally, suppose you want to create a new FASTQ file where these 500

9816

reads have their primer removed, but all the other reads are kept as

9817

they were?

9818

<<from Bio import SeqIO

9819

def trim_primer(record, primer) :

9820

if record.seq.startswith(primer) :

9821

return record[len(primer):]

9822

else :

9823

return record

9824

9825

trimmed_reads = (trim_primer(record, "GTTGGAACCG") for record in \

9826

SeqIO.parse(open("SRR014849.fastq"), "fastq"))

9827

out_handle = open("trimmed.fastq", "w")

9828

count = SeqIO.write(trimmed_reads, out_handle, "fastq")

9829

out_handle.close()

9830

print "Saved %i reads" % count

9831

>>

9832

9833

This takes longer, as this time the output file contains all 94696

9834

reads. Again, we're used a generator expression to avoid any memory

9835

problems. Although it is slower, you might prefer to use a for loop:

9836

<<from Bio import SeqIO

9837

out_handle = open("trimmed.fastq", "w")

9838

for record in SeqIO.parse(open("SRR014849.fastq"),"fastq") :

9839

if record.seq.startswith("GTTGGAACCG") :

9840

out_handle.write(record[10:].format("fastq"))

9841

else :

9842

out_handle.write(record.format("fastq"))

9843

out_handle.close()

9844

>>

9845

9846

In this case the for loop looks simpler, but putting the trim logic

9847

into a function is more tidy, and makes it easier to adjust the trimming

9848

later on. For example, you might decide to look for a 5' primer as well.

9849

Or, as in the following example, look for the primer anywhere in the

9850

reads - not just at the beginning.

9851

9852

9853

14.1.5 Trimming off adaptor sequences

9854

======================================

9855

9856

This is essentially a trivial extension to the previous example. We

9857

are going to going to pretend GTTGGAACCG is an adaptor sequence in some

9858

FASTQ formatted read data, again the SRR014849.fastq file from the NCBI

9859

(ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX003/SRX003639/SRR014849.fastq.

9860

gz).

9861

This time however, we will look for the sequence anywhere in the

9862

reads, not just at the very beginning:

9863

<<from Bio import SeqIO

b'def trim_adaptor(record, adaptor) :'

0

9864

9865

"""Removes adaptor sequences, looks for perfect matches only."""

1

9866

9867

"""

2

9868

index = record.seq.find(adaptor)

3

9869

if index == -1 :

4

9870

9871

return record #not found, no trimming

5

9872

else :

6

9873

return

9874

record[index+len(adaptor):]

b'trimmed_reads = (trim_adaptor(record,'

9875

"GTTGGAACCG") for record in \

7

9876

9877

SeqIO.parse(open("SRR014849.fastq"), "fastq"))

b'out_handle ='

9878

open("trimmed.fastq", "w")

b'count = SeqIO.write(trimmed_reads,'

9879

out_handle, "fastq")

b'out_handle.close() '

b'print "Saved %i reads" %'

9880

count

9881

>>

9882

9883

Because we are using a FASTQ input file in this example, the

9884

'SeqRecord' objects have per-letter-annotation for the quality scores.

9885

By slicing the 'SeqRecord' object the appropriate scores are used on the

9886

trimmed records, so we can output them as a FASTQ file too.

9887

By changing the format names, you could apply this to FASTA files

9888

instead. This code also could be extended to do a fuzzy match instead of

9889

an exact match (maybe using a pairwise alignment, or taking into account

9890

the read quality scores), but that will be much slower.

9891

9892

9893

14.1.6 Converting FASTQ files

9894

==============================

9895

9896

Back in Section 5.4.1 we showed how to use 'Bio.SeqIO' to convert

9897

between two file formats. Here we'll go into a little more detail

9898

regarding FASTQ files which are used in second generation DNA

9899

sequencing. FASTQ files store both the DNA sequence (as a string) and

9900

the associated read qualities.

9901

PHRED scores (used in some FASTQ files, and also in QUAL files and ACE

9902

files) have become a de facto standard for representing the probability

9903

of a sequencing error (here denoted by P_e) at a given base using a

9904

simple base ten log transformation:

9905

Q ( P

9906

= - 10 X log ) (14.1)

9907

PHRED 10 e

9908

9909

This means a wrong read (P_e = 1) gets a PHRED quality of 0, while a

9910

very good read like P_e = 0.00001 gets a PHRED quality of 50. While for

9911

raw sequencing data qualities higher than this are rare, with post

9912

processing such as read mapping or assembly, qualities of up to about 90

9913

are possible (indeed, the MAQ tool allows for PHRED scores in the range

9914

0 to 93 inclusive).

9915

The FASTQ format has the potential to become a de facto standard for

9916

storing the letters and quality scores for a sequencing read in a single

9917

plain text file. The only fly in the ointment is that there are at least

9918

three versions of the FASTQ format which are incompatible and difficult

9919

to distinguish...

9920

9921

9922

1. The original Sanger FASTQ format uses PHRED qualities encoded with

9923

an ASCII offset of 33. The NCBI are using this format in their Short

9924

Read Archive. We call this the fastq (or fastq-sanger) format in

9925

'Bio.SeqIO'.

9926

2. Solexa (later bought by Illumina) introduced their own version

9927

using Solexa qualities encoded with an ASCII offset of 64. We call

9928

this the fastq-solexa format.

9929

3. Illumina pipeline 1.3 onwards produces FASTQ files with PHRED

9930

qualities (which is more consistent), but encoded with an ASCII

9931

offset of 64. We call this the fastq-illumina format.

9932

9933

The Solexa quality scores are defined using a different log

9934

transformation:

9935

( P )

9936

( )

9937

Q ( e )

9938

= - 10 X log (---- ) (14.2)

9939

Solexa 10 (1-P )

9940

( )

9941

( e )

9942

9943

Given Solexa/Illumina have now moved to using PHRED scores in version

9944

1.3 of their pipeline, the Solexa quality scores will gradually fall out

9945

of use. If you equate the error estimates (P_e) these two equations

9946

allow conversion between the two scoring systems - and Biopython

9947

includes functions to do this in the 'Bio.SeqIO.QualityIO' module, which

9948

are called if you use 'Bio.SeqIO' to convert an old Solexa/Illumina file

9949

into a standard Sanger FASTQ file:

9950

<<from Bio import SeqIO

9951

records = SeqIO.parse(open("solexa.fastq"), "fastq-solexa")

9952

out_handle = open("standard.fastq", "w")

9953

SeqIO.write(records, handle, "fastq")

9954

out_handle.close()

9955

>>

9956

9957

If you want to convert a new Illumina 1.3+ FASTQ file, all that gets

9958

changed is the ASCII offset because although encoded differently the

9959

scores are all PHRED qualities:

9960

<<from Bio import SeqIO

9961

records = SeqIO.parse(open("illumina.fastq"), "fastq-illumina")

9962

out_handle = open("standard.fastq", "w")

9963

SeqIO.write(records, handle, "fastq")

9964

out_handle.close()

9965

>>

9966

9967

For good quality reads, PHRED and Solexa scores are approximately

9968

equal, which means since both the fasta-solexa and fastq-illumina

9969

formats use an ASCII offset of 64 the files are almost the same. This

9970

was a deliberate design choice by Illumina, meaning applications

9971

expecting the old fasta-solexa style files will probably be OK using the

9972

newer fastq-illumina files (on good data). Of course, both variants are

9973

very different from the original fasta standard as used by Sanger, the

9974

NCBI, and elsewhere.

9975

For more details, see the built in help (also online (2)):

9976

<<>>> from Bio.SeqIO import QualityIO

9977

>>> help(QualityIO)

9978

...

9979

>>

9980

9981

9982

9983

14.1.7 Identifying open reading frames

9984

=======================================

9985

9986

A very simplistic first step at identifying possible genes is to look

9987

for open reading frames (ORFs). By this we mean look in all six frames

9988

for long regions without stop codons -- an ORF is just a region of

9989

nucleotides with no in frame stop codons.

9990

Of course, to find a gene you would also need to worry about locating

9991

a start codon, possible promoters -- and in Eukaryotes there are introns

9992

to worry about too. However, this approach is still useful in viruses

9993

and Prokaryotes.

9994

To show how you might approach this with Biopython, we'll need a

9995

sequence to search, and as an example we'll again use the bacterial

9996

plasmid -- although this time we'll start with a plain FASTA file with

9997

no pre-marked genes: NC_005816.fna (3). This is a bacterial sequence, so

9998

we'll want to use NCBI codon table 11 (see Section 3.8 about

9999

translation).

10000

<<from Bio import SeqIO

10001

record = SeqIO.read(open("NC_005816.fna"),"fasta")

10002

table = 11

10003

min_pro_len = 100

10004

>>

10005

10006

Here is a neat trick using the 'Seq' object's 'split' method to get a

10007

list of all the possible ORF translations in the six reading frames:

10008

<<for strand, nuc in [(+1, record.seq), (-1,

10009

record.seq.reverse_complement())] :

10010

for frame in range(3) :

10011

for pro in nuc[frame:].translate(table).split("*") :

10012

if len(pro) >= min_pro_len :

10013

print "%s...%s - length %i, strand %i, frame %i" \

10014

% (pro[:30], pro[-3:], len(pro), strand, frame)

10015

>>

10016

10017

This should give:

10018

<<GCLMKKSSIVATIITILSGSANAASSQLIP...YRF - length 315, strand 1, frame 0

10019

KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE - length 285, strand 1, frame 1

10020

GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYH - length 176, strand 1, frame 1

10021

VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLP - length 165, strand 1, frame 1

10022

NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA - length 355, strand 1, frame 2

10023

RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTR - length 128, strand 1, frame 2

10024

TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIK - length 100, strand 1, frame 2

10025

QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD - length 114, strand -1, frame 0

10026

IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHV - length 111, strand -1, frame 0

10027

WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESK - length 125, strand -1, frame 1

10028

RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLK - length 361, strand -1, frame 1

10029

WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKR - length 111, strand -1, frame 1

10030

LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAA - length 107, strand -1, frame 2

10031

RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQ - length 119, strand -1, frame 2

10032

>>

10033

10034

Note that here we are counting the frames from the 5' end (start) of

10035

each strand. It is sometimes easier to always count from the 5' end

10036

(start) of the forward strand.

10037

You could easily edit the above loop based code to build up a list of

10038

the candidate proteins, or convert this to a list comprehension. Now,

10039

one thing this code doesn't do is keep track of where the proteins are.

10040

You could tackle this in several ways. For example, the following code

10041

tracks the locations in terms of the protein counting, and converts back

10042

to the parent sequence by multiplying by three, then adjusting for the

10043

frame and strand:

10044

<<from Bio import SeqIO

10045

record = SeqIO.read(open("NC_005816.gb"),"genbank")

10046

table = 11

10047

min_pro_len = 100

10048

10049

def find_orfs_with_trans(seq, trans_table, min_protein_length) :

10050

answer = []

10051

seq_len = len(seq)

10052

for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())] :

10053

for frame in range(3) :

10054

trans = str(nuc[frame:].translate(trans_table))

10055

trans_len = len(trans)

10056

aa_start = 0

10057

aa_end = 0

10058

while aa_start < trans_len :

10059

aa_end = trans.find("*", aa_start)

10060

if aa_end == -1 :

10061

aa_end = trans_len

10062

if aa_end-aa_start >= min_protein_length :

10063

if strand == 1 :

10064

start = frame+aa_start*3

10065

end = min(seq_len,frame+aa_end*3+3)

10066

else :

10067

start = seq_len-frame-aa_end*3-3

10068

end = seq_len-frame-aa_start*3

10069

10070

answer.append((start, end, strand,

10071

trans[aa_start:aa_end]))

10072

aa_start = aa_end+1

10073

answer.sort()

10074

return answer

10075

10076

orf_list = find_orfs_with_trans(record.seq, table, min_pro_len)

10077

for start, end, strand, pro in orf_list :

10078

print "%s...%s - length %i, strand %i, %i:%i" \

10079

% (pro[:30], pro[-3:], len(pro), strand, start, end)

10080

>>

10081

10082

And the output:

10083

<<NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA - length 355, strand 1, 41:1109

10084

WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKR - length 111, strand -1, 491:827

10085

KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE - length 285, strand 1, 1030:1888

10086

RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQ - length 119, strand -1,

10087

2830:3190

10088

RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTR - length 128, strand 1, 3470:3857

10089

GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYH - length 176, strand 1, 4249:4780

10090

RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLK - length 361, strand -1,

10091

4814:5900

10092

VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLP - length 165, strand 1, 5923:6421

10093

LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAA - length 107, strand -1,

10094

5974:6298

10095

GCLMKKSSIVATIITILSGSANAASSQLIP...YRF - length 315, strand 1, 6654:7602

10096

IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHV - length 111, strand -1,

10097

7788:8124

10098

WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESK - length 125, strand -1,

10099

8087:8465

10100

TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIK - length 100, strand 1, 8741:9044

10101

QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD - length 114, strand -1,

10102

9264:9609

10103

>>

10104

10105

If you comment out the sort statement, then the protein sequences will

10106

be shown in the same order as before, so you can check this is doing the

10107

same thing. Here we have sorted them by location to make it easier to

10108

compare to the actual annotation in the GenBank file (as visualised in

10109

Section 13.1.8).

10110

If however all you want to find are the locations of the open reading

10111

frames, then it is a waste of time to translate every possible codon,

10112

including doing the reverse complement to search the reverse strand too.

10113

All you need to do is search for the possible stop codons (and their

10114

reverse complements). Using regular expressions is an obvious approach

10115

here -- these are an extremely powerful (but rather complex) way of

10116

describing search strings, which are supported in lots of programming

10117

languages and also command line tools like grep as well). You can find

10118

whole books about this topic!

10119

10120

10121

14.2 Sequence parsing plus simple plots

10122

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

10123

10124

10125

This section shows some more examples of sequence parsing, using the

10126

'Bio.SeqIO' module described in Chapter 5, plus the Python library

10127

matplotlib's 'pylab' plotting interface (see the matplotlib website for

10128

a tutorial (4)). Note that to follow these examples you will need

10129

matplotlib installed - but without it you can still try the data parsing

10130

bits.

10131

10132

10133

14.2.1 Histogram of sequence lengths

10134

=====================================

10135

10136

There are lots of times when you might want to visualise the

10137

distribution of sequence lengths in a dataset -- for example the range

10138

of contig sizes in a genome assembly project. In this example we'll

10139

reuse our orchid FASTA file ls_orchid.fasta (5) which has only 94

10140

sequences.

10141

First of all, we will use 'Bio.SeqIO' to parse the FASTA file and

10142

compile a list of all the sequence lengths. You could do this with a for

10143

loop, but I find a list comprehension more pleasing:

10144

<<>>> from Bio import SeqIO

10145

>>> handle = open("ls_orchid.fasta")

10146

>>> sizes = [len(seq_record) for seq_record in SeqIO.parse(handle,

10147

"fasta")]

10148

>>> handle.close()

10149

>>> len(sizes), min(sizes), max(sizes)

10150

(94, 572, 789)

10151

>>> sizes

10152

[740, 753, 748, 744, 733, 718, 730, 704, 740, 709, 700, 726, ..., 592]

10153

>>

10154

10155

Now that we have the lengths of all the genes (as a list of integers),

10156

we can use the matplotlib histogram function to display it.

10157

<<from Bio import SeqIO

10158

handle = open("ls_orchid.fasta")

10159

sizes = [len(seq_record) for seq_record in SeqIO.parse(handle,

10160

"fasta")]

10161

handle.close()

10162

10163

import pylab

10164

pylab.hist(sizes, bins=20)

10165

pylab.title("%i orchid sequences\nLengths %i to %i" \

10166

% (len(sizes),min(sizes),max(sizes)))

10167

pylab.xlabel("Sequence length (bp)")

10168

pylab.ylabel("Count")

10169

pylab.show()

10170

>>

10171

10172

That should pop up a new window containing the following graph:

10173

*images/hist_plot.png*

10174

Notice that most of these orchid sequences are about 740 bp long,

10175

and there could be two distinct classes of sequence here with a subset

10176

of shorter sequences.

10177

Tip: Rather than using 'pylab.show()' to show the plot in a window,

10178

you can also use 'pylab.savefig(...)' to save the figure to a file (e.g.

10179

as a PNG or PDF).

10180

10181

10182

14.2.2 Plot of sequence GC%

10183

============================

10184

10185

Another easily calculated quantity of a nucleotide sequence is the

10186

GC%. You might want to look at the GC% of all the genes in a bacterial

10187

genome for example, and investigate any outliers which could have been

10188

recently acquired by horizontal gene transfer. Again, for this example

10189

we'll reuse our orchid FASTA file ls_orchid.fasta (6).

10190

First of all, we will use 'Bio.SeqIO' to parse the FASTA file and

10191

compile a list of all the GC percentages. Again, you could do this with

10192

a for loop, but I prefer the list comprehension used here:

10193

<<from Bio import SeqIO

10194

from Bio.SeqUtils import GC

10195

10196

handle = open("ls_orchid.fasta")

10197

gc_values = [GC(seq_record.seq) for seq_record in SeqIO.parse(handle,

10198

"fasta")]

10199

gc_values.sort()

10200

handle.close()

10201

>>

10202

10203

Having read in each sequence and calculated the GC%, we then sorted

10204

them into ascending order. Now we'll take this list of floating point

10205

values and plot them with matplotlib:

10206

<<import pylab

10207

pylab.plot(gc_values)

10208

pylab.title("%i orchid sequences\nGC%% %0.1f to %0.1f" \

10209

% (len(gc_values),min(gc_values),max(gc_values)))

10210

pylab.xlabel("Genes")

10211

pylab.ylabel("GC%")

10212

pylab.show()

10213

>>

10214

10215

As in the previous example, that should pop up a new window

10216

containing a graph:

10217

*images/gc_plot.png*

10218

If you tried this on the full set of genes from one organism, you'd

10219

probably get a much smoother plot than this.

10220

10221

10222

14.2.3 Nucleotide dot plots

10223

============================

10224

A dot plot is a way of visually comparing two nucleotide sequences

10225

for similarity to each other. A sliding window is used to compare short

10226

sub-sequences to each other, often with a mis-match threshold. Here for

10227

simplicity we'll only look for perfect matches (shown in black in the

10228

plot below).

10229

10230

To start off, we'll need two sequences. For the sake of argument,

10231

we'll just take the first two from our orchid FASTA file

10232

ls_orchid.fasta (7):

10233

<<from Bio import SeqIO

10234

handle = open("ls_orchid.fasta")

10235

record_iterator = SeqIO.parse(handle, "fasta")

10236

rec_one = record_iterator.next()

10237

rec_two = record_iterator.next()

10238

handle.close()

10239

>>

10240

10241

We're going to show two approaches. Firstly, a simple naive

10242

implementation which compares all the window sized sub-sequences to each

10243

other to compiles a similarity matrix. You could construct a matrix or

10244

array object, but here we just use a list of lists of booleans created

10245

with a nested list comprehension:

10246

<<window = 7

10247

seq_one = str(rec_one.seq).upper()

10248

seq_two = str(rec_two.seq).upper()

10249

data = [[(seq_one[i:i+window] <> seq_two[j:j+window]) \

10250

for j in range(len(seq_one)-window)] \

10251

for i in range(len(seq_two)-window)]

10252

>>

10253

10254

Note that we have not checked for reverse complement matches here. Now

10255

we'll use the matplotlib's 'pylab.imshow()' function to display this

10256

data, first requesting the gray color scheme so this is done in black

10257

and white:

10258

<<import pylab

10259

pylab.gray()

10260

pylab.imshow(data)

10261

pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one)))

10262

pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two)))

10263

pylab.title("Dot plot using window size %i\n(allowing no mis-matches)"

10264

% window)

10265

pylab.show()

10266

>>

10267

10268

That should pop up a new window containing a graph like this:

10269

*images/dot_plot.png*

10270

As you might have expected, these two sequences are very similar

10271

with a partial line of window sized matches along the diagonal. There

10272

are no off diagonal matches which would be indicative of inversions or

10273

other interesting events.

10274

The above code works fine on small examples, but there are two

10275

problems applying this to larger sequences, which we will address below.

10276

First off all, this brute force approach to the all against all

10277

comparisons is very slow. Instead, we'll compile dictionaries mapping

10278

the window sized sub-sequences to their locations, and then take the set

10279

intersection to find those sub-sequences found in both sequences. This

10280

uses more memory, but is much faster. Secondly, the 'pylab.imshow()'

10281

function is limited in the size of matrix it can display. As an

10282

alternative, we'll use the 'pylab.scatter()' function.

10283

We start by creating dictionaries mapping the window-sized

10284

sub-sequences to locations:

10285

<<window = 7

10286

dict_one = {}

10287

dict_two = {}

10288

for (seq, section_dict) in [(str(rec_one.seq).upper(), dict_one),

10289

(str(rec_two.seq).upper(), dict_two)] :

10290

for i in range(len(seq)-window) :

10291

section = seq[i:i+window]

10292

try :

10293

section_dict[section].append(i)

10294

except KeyError :

10295

section_dict[section] = [i]

10296

#Now find any sub-sequences found in both sequences

10297

#(Python 2.3 would require slightly different code here)

10298

matches = set(dict_one).intersection(dict_two)

10299

print "%i unique matches" % len(matches)

10300

>>

10301

In order to use the 'pylab.scatter()' we need separate lists for the x

10302

and y co-ordinates:

10303

<<#Create lists of x and y co-ordinates for scatter plot

10304

x = []

10305

y = []

10306

for section in matches :

10307

for i in dict_one[section] :

10308

for j in dict_two[section] :

10309

x.append(i)

10310

y.append(j)

10311

>>

10312

We are now ready to draw the revised dot plot as a scatter plot:

10313

<<import pylab

10314

pylab.cla() #clear any prior graph

10315

pylab.gray()

10316

pylab.scatter(x,y)

10317

pylab.xlim(0, len(rec_one)-window)

10318

pylab.ylim(0, len(rec_two)-window)

10319

pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one)))

10320

pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two)))

10321

pylab.title("Dot plot using window size %i\n(allowing no mis-matches)"

10322

% window)

10323

pylab.show()

10324

>>

10325

That should pop up a new window containing a graph like this:

10326

*images/dot_plot_scatter.png*

10327

Personally I find this second plot much easier to read! Again note

10328

that we have not checked for reverse complement matches here -- you

10329

could extend this example to do this, and perhaps plot the forward

10330

matches in one color and the reverse matches in another.

10331

10332

10333

14.2.4 Plotting the quality scores of sequencing read data

10334

===========================================================

10335

10336

If you are working with second generation sequencing data, you may

10337

want to try plotting the quality data. Here is an example using two

10338

FASTQ files containing paired end reads, SRR001666_1.fastq for the

10339

forward reads, and SRR001666_2.fastq for the reverse reads. These were

10340

downloaded from the NCBI short read archive (SRA) FTP site (see the

10341

SRR001666 page (8) and

10342

ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX000/SRX000430/ for details).

10343

In the following code the 'pylab.subplot(...)' function is used in

10344

order to show the forward and reverse qualities on two subplots, side by

10345

side. There is also a little bit of code to only plot the first fifty

10346

reads.

10347

<<import pylab

10348

from Bio import SeqIO

10349

for subfigure in [1,2] :

10350

filename = "SRR001666_%i.fastq" % subfigure

10351

pylab.subplot(1, 2, subfigure)

10352

for i,record in enumerate(SeqIO.parse(open(filename), "fastq")) :

10353

if i >= 50 : break #trick!

10354

pylab.plot(record.letter_annotations["phred_quality"])

10355

pylab.ylim(0,45)

10356

pylab.ylabel("PHRED quality score")

10357

pylab.xlabel("Position")

10358

pylab.savefig("SRR001666.png")

10359

print "Done"

10360

>>

10361

10362

You should note that we are using the 'Bio.SeqIO' format name fastq

10363

here because the NCBI has saved these reads using the standard Sanger

10364

FASTQ format with PHRED scores. However, as you might guess from the

10365

read lengths, this data was from an Illumina Genome Analyzer and was

10366

probably originally in one of the two Solexa/Illumina FASTQ variant file

10367

formats instead.

10368

This example uses the 'pylab.savefig(...)' function instead of

10369

'pylab.show(...)', but as mentioned before both are useful. Here is

10370

the result:

10371

*images/SRR001666.png*

10372

10373

10374

10375

14.3 Dealing with alignments

10376

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

10377

10378

10379

This section can been seen as a follow on to Chapter 6.

10380

10381

10382

14.3.1 Calculating summary information

10383

=======================================

10384

10385

Once you have an alignment, you are very likely going to want to find

10386

out information about it. Instead of trying to have all of the functions

10387

that can generate information about an alignment in the alignment object

10388

itself, we've tried to separate out the functionality into separate

10389

classes, which act on the alignment.

10390

Getting ready to calculate summary information about an object is

10391

quick to do. Let's say we've got an alignment object called 'alignment',

10392

for example read in using 'Bio.AlignIO.read(...)' as described in

10393

Chapter 6. All we need to do to get an object that will calculate

10394

summary information is:

10395

<<from Bio.Align import AlignInfo

10396

summary_align = AlignInfo.SummaryInfo(alignment)

10397

>>

10398

10399

The 'summary_align' object is very useful, and will do the following

10400

neat things for you:

10401

10402

10403

1. Calculate a quick consensus sequence -- see section 14.3.2

10404

2. Get a position specific score matrix for the alignment -- see

10405

section 14.3.3

10406

3. Calculate the information content for the alignment -- see

10407

section 14.3.4

10408

4. Generate information on substitutions in the alignment --

10409

section 14.4 details using this to generate a substitution matrix.

10410

10411

10412

10413

14.3.2 Calculating a quick consensus sequence

10414

==============================================

10415

10416

The 'SummaryInfo' object, described in section 14.3.1, provides

10417

functionality to calculate a quick consensus of an alignment. Assuming

10418

we've got a 'SummaryInfo' object called 'summary_align' we can calculate

10419

a consensus by doing:

10420

<<consensus = summary_align.dumb_consensus()

10421

>>

10422

10423

As the name suggests, this is a really simple consensus calculator,

10424

and will just add up all of the residues at each point in the consensus,

10425

and if the most common value is higher than some threshold value will

10426

add the common residue to the consensus. If it doesn't reach the

10427

threshold, it adds an ambiguity character to the consensus. The returned

10428

consensus object is Seq object whose alphabet is inferred from the

10429

alphabets of the sequences making up the consensus. So doing a 'print

10430

consensus' would give:

10431

<<consensus

10432

Seq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT

10433

...', IUPACAmbiguousDNA())

10434

>>

10435

10436

You can adjust how 'dumb_consensus' works by passing optional

10437

parameters:

10438

10439

10440

the threshold This is the threshold specifying how common a particular

10441

residue has to be at a position before it is added. The default is

10442

0.7 (meaning 70%).

10443

10444

the ambiguous character This is the ambiguity character to use. The

10445

default is 'N'.

10446

10447

the consensus alphabet This is the alphabet to use for the consensus

10448

sequence. If an alphabet is not specified than we will try to guess

10449

the alphabet based on the alphabets of the sequences in the

10450

alignment.

10451

10452

10453

10454

14.3.3 Position Specific Score Matrices

10455

========================================

10456

10457

Position specific score matrices (PSSMs) summarize the alignment

10458

information in a different way than a consensus, and may be useful for

10459

different tasks. Basically, a PSSM is a count matrix. For each column in

10460

the alignment, the number of each alphabet letters is counted and

10461

totaled. The totals are displayed relative to some representative

10462

sequence along the left axis. This sequence may be the consesus

10463

sequence, but can also be any sequence in the alignment. For instance

10464

for the alignment,

10465

<<GTATC

10466

AT--C

10467

CTGTC

10468

>>

10469

10470

the PSSM is:

10471

<< G A T C

10472

G 1 1 0 1

10473

T 0 0 3 0

10474

A 1 1 0 0

10475

T 0 0 2 0

10476

C 0 0 0 3

10477

>>

10478

10479

Let's assume we've got an alignment object called 'c_align'. To get a

10480

PSSM with the consensus sequence along the side we first get a summary

10481

object and calculate the consensus sequence:

10482

<<summary_align = AlignInfo.SummaryInfo(c_align)

10483

consensus = summary_align.dumb_consensus()

10484

>>

10485

10486

Now, we want to make the PSSM, but ignore any 'N' ambiguity residues

10487

when calculating this:

10488

<<my_pssm = summary_align.pos_specific_score_matrix(consensus,

10489

chars_to_ignore =

10490

['N'])

10491

>>

10492

10493

Two notes should be made about this:

10494

10495

10496

1. To maintain strictness with the alphabets, you can only include

10497

characters along the top of the PSSM that are in the alphabet of the

10498

alignment object. Gaps are not included along the top axis of the

10499

PSSM.

10500

10501

2. The sequence passed to be displayed along the left side of the

10502

axis does not need to be the consensus. For instance, if you wanted

10503

to display the second sequence in the alignment along this axis, you

10504

would need to do:

10505

<<second_seq = alignment.get_seq_by_num(1)

10506

my_pssm = summary_align.pos_specific_score_matrix(second_seq

10507

chars_to_ignore =

10508

['N'])

10509

>>

10510

10511

10512

The command above returns a 'PSSM' object. To print out the PSSM as we

10513

showed above, we simply need to do a 'print my_pssm', which gives:

10514

<< A C G T

10515

T 0.0 0.0 0.0 7.0

10516

A 7.0 0.0 0.0 0.0

10517

T 0.0 0.0 0.0 7.0

10518

A 7.0 0.0 0.0 0.0

10519

C 0.0 7.0 0.0 0.0

10520

A 7.0 0.0 0.0 0.0

10521

T 0.0 0.0 0.0 7.0

10522

T 1.0 0.0 0.0 6.0

10523

...

10524

>>

10525

10526

You can access any element of the PSSM by subscripting like

10527

'your_pssm[sequence_number][residue_count_name]'. For instance, to get

10528

the counts for the 'A' residue in the second element of the above PSSM

10529

you would do:

10530

<<>>> print my_pssm[1]["A"]

10531

7.0

10532

>>

10533

10534

The structure of the PSSM class hopefully makes it easy both to access

10535

elements and to pretty print the matrix.

10536

10537

10538

14.3.4 Information Content

10539

===========================

10540

10541

A potentially useful measure of evolutionary conservation is the

10542

information content of a sequence.

10543

A useful introduction to information theory targetted towards

10544

molecular biologists can be found at

10545

http://www.lecb.ncifcrf.gov/~toms/paper/primer/. For our purposes, we

10546

will be looking at the information content of a consesus sequence, or a

10547

portion of a consensus sequence. We calculate information content at a

10548

particular column in a multiple sequence alignment using the following

10549

formula:

10550

N

10551

(P )

10552

a ( )

10553

IC -- P ( ij)

10554

= \ log(---)

10555

j / ij (Q )

10556

-- ( )

10557

i=1 ( i )

10558

10559

where:

10560

10561

10562

- IC_j -- The information content for the j-th column in an

10563

alignment.

10564

- N_a -- The number of letters in the alphabet.

10565

- P_ij -- The frequency of a particular letter i in the j-th column

10566

(i. e. if G occured 3 out of 6 times in an aligment column, this

10567

would be 0.5)

10568

- Q_i -- The expected frequency of a letter i. This is an optional

10569

argument, usage of which is left at the user's discretion. By

10570

default, it is automatically assigned to 0.05 = 1/20 for a protein

10571

alphabet, and 0.25 = 1/4 for a nucleic acid alphabet. This is for

10572

geting the information content without any assumption of prior

10573

distribtions. When assuming priors, or when using a non-standard

10574

alphabet, you should supply the values for Q_i.

10575

10576

Well, now that we have an idea what information content is being

10577

calculated in Biopython, let's look at how to get it for a particular

10578

region of the alignment.

10579

First, we need to use our alignment to get a alignment summary object,

10580

which we'll assume is called 'summary_align' (see section 14.3.1) for

10581

instructions on how to get this. Once we've got this object, calculating

10582

the information content for a region is as easy as:

10583

<<info_content = summary_align.information_content(5, 30,

10584

chars_to_ignore =

10585

['N'])

10586

>>

10587

10588

Wow, that was much easier then the formula above made it look! The

10589

variable 'info_content' now contains a float value specifying the

10590

information content over the specified region (from 5 to 30 of the

10591

alignment). We specifically ignore the ambiguity residue 'N' when

10592

calculating the information content, since this value is not included in

10593

our alphabet (so we shouldn't be interested in looking at it!).

10594

As mentioned above, we can also calculate relative information content

10595

by supplying the expected frequencies:

10596

<<expect_freq = {

10597

'A' : .3,

10598

'G' : .2,

10599

'T' : .3,

10600

'C' : .2}

10601

>>

10602

10603

The expected should not be passed as a raw dictionary, but instead by

10604

passed as a 'SubsMat.FreqTable' object (see section 16.2.2 for more

10605

information about FreqTables). The FreqTable object provides a standard

10606

for associating the dictionary with an Alphabet, similar to how the

10607

Biopython Seq class works.

10608

To create a FreqTable object, from the frequency dictionary you just

10609

need to do:

10610

<<from Bio.Alphabet import IUPAC

10611

from Bio.SubsMat import FreqTable

10612

10613

e_freq_table = FreqTable.FreqTable(expect_freq, FreqTable.FREQ,

10614

IUPAC.unambiguous_dna)

10615

>>

10616

10617

Now that we've got that, calculating the relative information content

10618

for our region of the alignment is as simple as:

10619

<<info_content = summary_align.information_content(5, 30,

10620

e_freq_table =

10621

e_freq_table,

10622

chars_to_ignore =

10623

['N'])

10624

>>

10625

10626

Now, 'info_content' will contain the relative information content over

10627

the region in relation to the expected frequencies.

10628

The value return is calculated using base 2 as the logarithm base in

10629

the formula above. You can modify this by passing the parameter

10630

'log_base' as the base you want:

10631

<<info_content = summary_align.information_content(5, 30, log_base = 10,

10632

chars_to_ignore =

10633

['N'])

10634

>>

10635

10636

Well, now you are ready to calculate information content. If you want

10637

to try applying this to some real life problems, it would probably be

10638

best to dig into the literature on information content to get an idea of

10639

how it is used. Hopefully your digging won't reveal any mistakes made in

10640

coding this function!

10641

10642

10643

14.4 Substitution Matrices

10644

*=*=*=*=*=*=*=*=*=*=*=*=*=*

10645

10646

10647

Substitution matrices are an extremely important part of everyday

10648

bioinformatics work. They provide the scoring terms for classifying how

10649

likely two different residues are to substitute for each other. This is

10650

essential in doing sequence comparisons. The book "Biological Sequence

10651

Analysis" by Durbin et al. provides a really nice introduction to

10652

Substitution Matrices and their uses. Some famous substitution matrices

10653

are the PAM and BLOSUM series of matrices.

10654

Biopython provides a ton of common substitution matrices, and also

10655

provides functionality for creating your own substitution matrices.

10656

10657

10658

14.4.1 Using common substitution matrices

10659

==========================================

10660

10661

10662

10663

14.4.2 Creating your own substitution matrix from an alignment

10664

===============================================================

10665

10666

A very cool thing that you can do easily with the substitution matrix

10667

classes is to create your own substitution matrix from an alignment. In

10668

practice, this is normally done with protein alignments. In this

10669

example, we'll first get a Biopython alignment object and then get a

10670

summary object to calculate info about the alignment. The file

10671

containing protein.aln (9) (also available online here (10)) contains

10672

the Clustalw alignment output.

10673

<<from Bio import Clustalw

10674

from Bio.Alphabet import IUPAC

10675

from Bio.Align import AlignInfo

10676

10677

# get an alignment object from a Clustalw alignment output

10678

c_align = Clustalw.parse_file("protein.aln", IUPAC.protein)

10679

summary_align = AlignInfo.SummaryInfo(c_align)

10680

>>

10681

10682

Sections 6.3.1 and 14.3.1 contain more information on doing this.

10683

Now that we've got our 'summary_align' object, we want to use it to

10684

find out the number of times different residues substitute for each

10685

other. To make the example more readable, we'll focus on only amino

10686

acids with polar charged side chains. Luckily, this can be done easily

10687

when generating a replacement dictionary, by passing in all of the

10688

characters that should be ignored. Thus we'll create a dictionary of

10689

replacements for only charged polar amino acids using:

10690

<<replace_info = summary_align.replacement_dictionary(["G", "A", "V",

10691

"L", "I",

10692

"M", "P", "F",

10693

"W", "S",

10694

"T", "N", "Q",

10695

"Y", "C"])

10696

>>

10697

10698

This information about amino acid replacements is represented as a

10699

python dictionary which will look something like:

10700

<<{('R', 'R'): 2079.0, ('R', 'H'): 17.0, ('R', 'K'): 103.0, ('R', 'E'):

10701

2.0,

10702

('R', 'D'): 2.0, ('H', 'R'): 0, ('D', 'H'): 15.0, ('K', 'K'): 3218.0,

10703

('K', 'H'): 24.0, ('H', 'K'): 8.0, ('E', 'H'): 15.0, ('H', 'H'):

10704

1235.0,

10705

('H', 'E'): 18.0, ('H', 'D'): 0, ('K', 'D'): 0, ('K', 'E'): 9.0,

10706

('D', 'R'): 48.0, ('E', 'R'): 2.0, ('D', 'K'): 1.0, ('E', 'K'): 45.0,

10707

('K', 'R'): 130.0, ('E', 'D'): 241.0, ('E', 'E'): 3305.0,

10708

('D', 'E'): 270.0, ('D', 'D'): 2360.0}

10709

>>

10710

10711

This information gives us our accepted number of replacements, or how

10712

often we expect different things to substitute for each other. It turns

10713

out, amazingly enough, that this is all of the information we need to go

10714

ahead and create a substitution matrix. First, we use the replacement

10715

dictionary information to create an Accepted Replacement Matrix (ARM):

10716

<<from Bio import SubsMat

10717

my_arm = SubsMat.SeqMat(replace_info)

10718

>>

10719

10720

With this accepted replacement matrix, we can go right ahead and

10721

create our log odds matrix (i. e. a standard type Substitution Matrix):

10722

<<my_lom = SubsMat.make_log_odds_matrix(my_arm)

10723

>>

10724

10725

The log odds matrix you create is customizable with the following

10726

optional arguments:

10727

10728

10729

- 'exp_freq_table' -- You can pass a table of expected frequencies

10730

for each alphabet. If supplied, this will be used instead of the

10731

passed accepted replacement matrix when calculate expected

10732

replacments.

10733

10734

- 'logbase' - The base of the logarithm taken to create the log odd

10735

matrix. Defaults to base 10.

10736

10737

- 'factor' - The factor to multiply each matrix entry by. This

10738

defaults to 10, which normally makes the matrix numbers easy to work

10739

with.

10740

10741

- 'round_digit' - The digit to round to in the matrix. This defaults

10742

to 0 (i. e. no digits).

10743

10744

Once you've got your log odds matrix, you can display it prettily

10745

using the function 'print_mat'. Doing this on our created matrix gives:

10746

<<>>> my_lom.print_mat()

10747

D 6

10748

E -5 5

10749

H -15 -13 10

10750

K -31 -15 -13 6

10751

R -13 -25 -14 -7 7

10752

D E H K R

10753

>>

10754

10755

Very nice. Now we've got our very own substitution matrix to play

10756

with!

10757

10758

10759

14.5 BioSQL -- storing sequences in a relational database

10760

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

10761

10762

BioSQL (11) is a joint effort between the OBF (12) projects

10763

(BioPerl, BioJava etc) to support a shared database schema for storing

10764

sequence data. In theory, you could load a GenBank file into the

10765

database with BioPerl, then using Biopython extract this from the

10766

database as a record object with featues - and get more or less the same

10767

thing as if you had loaded the GenBank file directly as a SeqRecord

10768

using 'Bio.SeqIO' (Chapter 5).

10769

Biopython's BioSQL module is currently documented at

10770

http://biopython.org/wiki/BioSQL which is part of our wiki pages.

10771

10772

10773

14.6 InterPro

10774

*=*=*=*=*=*=*=

10775

10776

10777

The 'Bio.InterPro' module works with files from the InterPro database,

10778

which can be obtained from the InterPro project:

10779

http://www.ebi.ac.uk/interpro/.

10780

The 'Bio.InterPro.Record' contains all the information stored in an

10781

InterPro record. Its string representation also is a valid InterPro

10782

record, but it is NOT guaranteed to be equivalent to the record from

10783

which it was produced.

10784

The 'Bio.InterPro.Record' contains:

10785

10786

10787

- 'Database'

10788

- 'Accession'

10789

- 'Name'

10790

- 'Dates'

10791

- 'Type'

10792

- 'Parent'

10793

- 'Process'

10794

- 'Function'

10795

- 'Component'

10796

- 'Signatures'

10797

- 'Abstract'

10798

- 'Examples'

10799

- 'References'

10800

- 'Database links'

10801

10802

-----------------------------------

10803

10804

10805

(1) http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.gb

10806

10807

(2) http://www.biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO-module.h

10808

tml

10809

10810

(3) http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.fna

10811

10812

(4) http://matplotlib.sourceforge.net/

10813

10814

(5) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta

10815

10816

(6) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta

10817

10818

(7) http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta

10819

10820

(8) http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=

10821

viewer&run=SRR001666

10822

10823

(9) examples/protein.aln

10824

10825

(10) http://biopython.org/DIST/docs/tutorial/examples/protein.aln

10826

10827

(11) http://www.biosql.org/

10828

10829

(12) http://open-bio.org/

10830

10831

10832

Chapter 15 The Biopython testing framework

10833

*********************************************

10834

10835

Biopython has a regression testing framework (the file 'run_tests.py')

10836

based on unittest (1), the standard unit testing framework for Python.

10837

Providing comprehensive tests for modules is one of the most important

10838

aspects of making sure that the Biopython code is as bug-free as

10839

possible before going out. It also tends to be one of the most

10840

undervalued aspects of contributing. This chapter is designed to make

10841

running the Biopython tests and writing good test code as easy as

10842

possible. Ideally, every module that goes into Biopython should have a

10843

test (and should also have documentation!). All our developers, and

10844

anyone installing Biopython from source, are strongly encouraged to run

10845

the unit tests.

10846

10847

10848

15.1 Running the tests

10849

*=*=*=*=*=*=*=*=*=*=*=*

10850

10851

10852

When you download the Biopython source code, or check it out from our

10853

source code repository, you should find a subdirectory call 'Tests'.

10854

This contains the key script 'run_tests.py', lots of individual scripts

10855

named 'test_XXX.py', a subdirectory called 'output' and lots of other

10856

subdirectories which contain input files for the test suite.

10857

As part of building and installing Biopython you will typically run

10858

the full test suite at the command line from the Biopython source top

10859

level directory using the following:

10860

<<python setup.py test

10861

>>

10862

This is actually equivalent to going to the 'Tests' subdirectory and

10863

running:

10864

<<python run_tests.py

10865

>>

10866

10867

You'll often want to run just some of the tests, and this is done like

10868

this:

10869

<<python run_tests.py test_SeqIO.py test_AlignIO.py

10870

>>

10871

When giving the list of tests, the '.py' extension is optional, so you

10872

can also just type:

10873

<<python run_tests.py test_SeqIO test_AlignIO

10874

>>

10875

To run the docstring tests (see section 15.3), you can use

10876

<<python run_tests.py doctest

10877

>>

10878

By default, 'run_tests.py' runs all tests, including the docstring

10879

tests.

10880

If an individual test is failing, you can also try running it

10881

directly, which may give you more information.

10882

Importantly, note that the individual unit tests come in two types:

10883

10884

- Simple print-and-compare scripts. These unit tests are essentially

10885

short example Python programs, which print out various output text.

10886

For a test file named 'test_XXX.py' there will be a matching text

10887

file called 'test_XXX' under the 'output' subdirectory which contains

10888

the expected output. All that the test framework does to is run the

10889

script, and check the output agrees.

10890

- Standard 'unittest'- based tests. These will 'import unittest' and

10891

then define 'unittest.TestCase' classes, each with one or more

10892

sub-tests as methods starting with 'test_' which check some specific

10893

aspect of the code. These tests should not print any output directly.

10894

10895

Currently, about half of the Biopython tests are 'unittest'-style

10896

tests, and half are print-and-compare tests.

10897

Running a simple print-and-compare test directly will usually give

10898

lots of output on screen, but does not check the output matches the

10899

expected output. If the test is failing with an exception error, it

10900

should be very easy to locate where exactly the script is failing. For

10901

an example of a print-and-compare test, try:

10902

<<python test_SeqIO.py

10903

>>

10904

10905

The 'unittest'-based tests instead show you exactly which

10906

sub-section(s) of the test are failing. For example,

10907

<<python test_Cluster.py

10908

>>

10909

10910

10911

10912

15.2 Writing tests

10913

*=*=*=*=*=*=*=*=*=*

10914

10915

10916

Let's say you want to write some tests for a module called 'Biospam'.

10917

This can be a module you wrote, or an existing module that doesn't have

10918

any tests yet. In the examples below, we assume that 'Biospam' is a

10919

module that does simple math.

10920

Each Biopython test can have three important files and directories

10921

involved with it:

10922

10923

10924

1. 'test_Biospam.py' -- The actual test code for your module.

10925

2. 'Biospam' [optional]-- A directory where any necessary input files

10926

will be located. Any output files that will be generated should also

10927

be written here (and preferrably cleaned up after the tests are done)

10928

to prevent clogging up the main Tests directory.

10929

3. 'output/Biospam' -- [for print-and-compare tests only] This file

10930

contains the expected output from running 'test_Biospam.py'. This

10931

file is not needed for 'unittest'-style tests, since there the

10932

validation is done in the test script 'test_Biospam.py' itself.

10933

10934

It's up to you to decide whether you want to write a print-and-compare

10935

test script or a 'unittest'-style test script. The important thing is

10936

that you cannot mix these two styles in a single test script.

10937

Particularly, don't use 'unittest' features in a print-and-compare test.

10938

Any script with a 'test_' prefix in the 'Tests' directory will be

10939

found and run by 'run_tests.py'. Below, we show an example test script

10940

'test_Biospam.py' both for a print-and-compare test and for a

10941

'unittest'-based test. If you put this script in the Biopython 'Tests'

10942

directory, then 'run_tests.py' will find it and execute the tests

10943

contained in it:

10944

<<$ python run_tests.py

10945

test_Ace ... ok

10946

test_AlignIO ... ok

10947

test_BioSQL ... ok

10948

test_BioSQL_SeqIO ... ok

10949

test_Biospam ... ok

10950

test_CAPS ... ok

10951

test_Clustalw ... ok

10952

>>

10953

...

10954

<<----------------------------------------------------------------------

10955

Ran 107 tests in 86.127 seconds

10956

>>

10957

10958

10959

10960

15.2.1 Writing a print-and-compare test

10961

========================================

10962

10963

A print-and-compare style test should be much simpler for beginners or

10964

novices to write - essentially it is just an example script using your

10965

new module.

10966

Here is what you should do to make a print-and-compare test for the

10967

'Biospam' module.

10968

10969

10970

1. Write a script called 'test_Biospam.py'

10971

10972

10973

10974

- This script should live in the Tests directory

10975

10976

- The script should test all of the important functionality of the

10977

module (the more you test the better your test is, of course!).

10978

10979

- Try to avoid anything which might be platform specific, such as

10980

printing floating point numbers without using an explicit

10981

formatting string to avoid having too many decimal places

10982

(different platforms can give very slightly different values).

10983

10984

10985

2. If the script requires files to do the testing, these should go in

10986

the directory Tests/Biospam (if you just need something generic, like

10987

a FASTA sequence file, or a GenBank record, try and use an existing

10988

sample input file instead).

10989

10990

3. Write out the test output and verify the output to be correct.

10991

There are two ways to do this:

10992

10993

10994

1. The long way:

10995

10996

10997

10998

- Run the script and write its output to a file. On UNIX

10999

(including Linux and Mac OS X) machines, you would do something

11000

like: 'python test_Biospam.py > test_Biospam' which would write

11001

the output to the file 'test_Biospam'.

11002

11003

- Manually look at the file 'test_Biospam' to make sure the

11004

output is correct. When you are sure it is all right and there

11005

are no bugs, you need to quickly edit the 'test_Biospam' file

11006

so that the first line is: `'test_Biospam'' (no quotes).

11007

11008

- copy the 'test_Biospam' file to the directory Tests/output

11009

11010

11011

2. The quick way:

11012

11013

11014

- Run 'python run_tests.py -g test_Biospam.py'. The regression

11015

testing framework is nifty enough that it'll put the output in

11016

the right place in just the way it likes it.

11017

11018

- Go to the output (which should be in

11019

'Tests/output/test_Biospam') and double check the output to

11020

make sure it is all correct.

11021

11022

11023

11024

4. Now change to the Tests directory and run the regression tests

11025

with 'python run_tests.py'. This will run all of the tests, and you

11026

should see your test run (and pass!).

11027

11028

5. That's it! Now you've got a nice test for your module ready to

11029

check in, or submit to Biopython. Congratulations!

11030

11031

As an example, the 'test_Biospam.py' test script to test the

11032

'addition' and 'multiplication' functions in the 'Biospam' module could

11033

look as follows:

11034

<<from Bio import Biospam

11035

11036

print "2 + 3 =", Biospam.addition(2, 3)

11037

print "9 - 1 =", Biospam.addition(9, -1)

11038

print "2 * 3 =", Biospam.multiplication(2, 3)

11039

print "9 * (- 1) =", Biospam.multiplication(9, -1)

11040

>>

11041

11042

We generate the corresponding output with 'python run_tests.py -g

11043

test_Biospam.py', and check the output file 'output/test_Biospam':

11044

<<test_Biospam

11045

2 + 3 = 5

11046

9 - 1 = 8

11047

2 * 3 = 6

11048

9 * (- 1) = -9

11049

>>

11050

11051

Often, the difficulty with larger print-and-compare tests is to keep

11052

track which line in the output corresponds to which command in the test

11053

script. For this purpose, it is important to print out some markers to

11054

help you match lines in the input script with the generated output.

11055

11056

11057

15.2.2 Writing a unittest-based test

11058

=====================================

11059

11060

We want all the modules in Biopython to have unit tests, and a simple

11061

print-and-compare test is better than no test at all. However, although

11062

there is a steeper learning curve, using the 'unittest' framework gives

11063

a more structured result, and if there is a test failure this can

11064

clearly pinpoint which part of the test is going wrong. The sub-tests

11065

can also be run individually which is helpful for testing or debugging.

11066

The 'unittest'-framework has been included with Python since version

11067

2.1, and is documented in the Python Library Reference (which I know you

11068

are keeping under your pillow, as recommended). There is also online

11069

documentaion for unittest (2). If you are familiar with the 'unittest'

11070

system (or something similar like the nose test framework), you

11071

shouldn't have any trouble. You may find looking at the existing example

11072

within Biopython helpful too.

11073

Here's a minimal 'unittest'-style test script for 'Biospam', which you

11074

can copy and paste to get started:

11075

<<import unittest

11076

from Bio import Biospam

11077

11078

class BiospamTestAddition(unittest.TestCase):

11079

11080

def test_addition1(self):

11081

result = Biospam.addition(2, 3)

11082

self.assertEqual(result, 5)

11083

11084

def test_addition2(self):

11085

result = Biospam.addition(9, -1)

11086

self.assertEqual(result, 8)

11087

11088

class BiospamTestDivision(unittest.TestCase):

11089

11090

def test_division1(self):

11091

result = Biospam.division(3.0, 2.0)

11092

self.assertAlmostEqual(result, 1.5)

11093

11094

def test_division2(self):

11095

result = Biospam.division(10.0, -2.0)

11096

self.assertAlmostEqual(result, -5.0)

11097

11098

11099

if __name__ == "__main__" :

11100

runner = unittest.TextTestRunner(verbosity = 2)

11101

unittest.main(testRunner=runner)

11102

>>

11103

11104

In the division tests, we use 'assertAlmostEqual' instead of

11105

'assertEqual' to avoid tests failing due to roundoff errors; see the

11106

'unittest' chapter in the Python documentation for details and for other

11107

functionality available in 'unittest' (online reference (3)).

11108

These are the key points of 'unittest'-based tests:

11109

11110

11111

- Test cases are stored in classes that derive from

11112

'unittest.TestCase' and cover one basic aspect of your code

11113

11114

- You can use methods 'setUp' and 'tearDown' for any repeated code

11115

which should be run before and after each test method. For example,

11116

the 'setUp' method might be used to create an instance of the object

11117

you are testing, or open a file handle. The 'tearDown' should do any

11118

"tidying up", for example closing the file handle.

11119

11120

- The tests are prefixed with 'test_' and each test should cover one

11121

specific part of what you are trying to test. You can have as many

11122

tests as you want in a class.

11123

11124

- At the end of the test script, you can use

11125

<<if __name__ == "__main__" :

11126

runner = unittest.TextTestRunner(verbosity = 2)

11127

unittest.main(testRunner=runner)

11128

>>

11129

to execute the tests when the script is run by itself (rather than

11130

imported from 'run_tests.py'). If you run this script, then you'll

11131

see something like the following:

11132

<<$ python test_BiospamMyModule.py

11133

test_addition1 (__main__.TestAddition) ... ok

11134

test_addition2 (__main__.TestAddition) ... ok

11135

test_division1 (__main__.TestDivision) ... ok

11136

test_division2 (__main__.TestDivision) ... ok

11137

11138

-------------------------------------------------------------------

11139

---

11140

Ran 4 tests in 0.059s

11141

11142

OK

11143

>>

11144

11145

11146

- To indicate more clearly what each test is doing, you can add

11147

docstrings to each test. These are shown when running the tests,

11148

which can be useful information if a test is failing.

11149

<<import unittest

11150

from Bio import Biospam

11151

11152

class BiospamTestAddition(unittest.TestCase):

11153

11154

def test_addition1(self):

11155

"""An addition test"""

11156

result = Biospam.addition(2, 3)

11157

self.assertEqual(result, 5)

11158

11159

def test_addition2(self):

11160

"""A second addition test"""

11161

result = Biospam.addition(9, -1)

11162

self.assertEqual(result, 8)

11163

11164

class BiospamTestDivision(unittest.TestCase):

11165

11166

def test_division1(self):

11167

"""Now let's check division"""

11168

result = Biospam.division(3.0, 2.0)

11169

self.assertAlmostEqual(result, 1.5)

11170

11171

def test_division2(self):

11172

"""A second division test"""

11173

result = Biospam.division(10.0, -2.0)

11174

self.assertAlmostEqual(result, -5.0)

11175

11176

11177

if __name__ == "__main__" :

11178

runner = unittest.TextTestRunner(verbosity = 2)

11179

unittest.main(testRunner=runner)

11180

>>

11181

11182

Running the script will now show you:

11183

<<$ python test_BiospamMyModule.py

11184

An addition test ... ok

11185

A second addition test ... ok

11186

Now let's check division ... ok

11187

A second division test ... ok

11188

11189

-------------------------------------------------------------------

11190

---

11191

Ran 4 tests in 0.001s

11192

11193

OK

11194

>>

11195

11196

If your module contains docstring tests (see section 15.3), you may

11197

want to include those in the tests to be run. You can do so as follows

11198

by modifying the code under 'if __name__ == "__main__":' to look like

11199

this:

11200

<<if __name__ == "__main__":

11201

unittest_suite =

11202

unittest.TestLoader().loadTestsFromName("test_Biospam")

11203

doctest_suite = doctest.DocTestSuite(Biospam)

11204

suite = unittest.TestSuite((unittest_suite, doctest_suite))

11205

runner = unittest.TextTestRunner(sys.stdout, verbosity = 2)

11206

runner.run(suite)

11207

>>

11208

11209

This is only relevant if you want to run the docstring tests when you

11210

exectute 'python test_Biospam.py'; with 'python run_tests.py', the

11211

docstring tests are run automatically (assuming they are included in the

11212

list of docstring tests in 'run_tests.py', see the section below).

11213

11214

11215

15.3 Writing doctests

11216

*=*=*=*=*=*=*=*=*=*=*=

11217

11218

11219

Python modules, classes and functions support built in documentation

11220

using docstrings. The doctest framework (4) (included with Python)

11221

allows the developer to embed working examples in the docstrings, and

11222

have these examples automatically tested.

11223

Currently only a small part of Biopython includes doctests. The

11224

'run_tests.py' script takes care of running the doctests. For this

11225

purpose, at the top of the 'run_tests.py' script is a manually compiled

11226

list of modules to test, which allows us to skip modules with optional

11227

external dependencies which may not be installed (e.g. the Reportlab and

11228

NumPy libraries). So, if you've added some doctests to the docstrings in

11229

a Biopython module, in order to have them included in the Biopython test

11230

suite, you must update 'run_tests.py' to include your module. Currently,

11231

the relevant part of 'run_tests.py' looks as follows:

11232

<<# This is the list of modules containing docstring tests.

11233

# If you develop docstring tests for other modules, please add

11234

# those modules here.

11235

DOCTEST_MODULES = ["Bio.Seq",

11236

"Bio.SeqRecord",

11237

"Bio.SeqIO",

11238

"Bio.Align.Generic",

11239

"Bio.AlignIO",

11240

"Bio.KEGG.Compound",

11241

"Bio.KEGG.Enzyme",

11242

"Bio.Wise",

11243

"Bio.Wise.psw",

11244

]

11245

#Silently ignore any doctests for modules requiring numpy!

11246

try:

11247

import numpy

11248

DOCTEST_MODULES.extend(["Bio.Statistics.lowess"])

11249

except ImportError:

11250

pass

11251

>>

11252

11253

Note that we regard doctests primarily as documentation, so you should

11254

stick to typical usage. Generally complicated examples dealing with

11255

error conditions and the like would be best left to a dedicated unit

11256

test.

11257

Note that if you want to write doctests involving file parsing,

11258

defining the file location complicates matters. Ideally use relative

11259

paths assuming the code will be run from the 'Tests' directory, see the

11260

'Bio.SeqIO' doctests for an example of this.

11261

To run the docstring tests only, use

11262

<<$ python run_tests.py doctest

11263

>>

11264

11265

-----------------------------------

11266

11267

11268

(1) http://docs.python.org/library/unittest.html

11269

11270

(2) http://docs.python.org/library/unittest.html

11271

11272

(3) http://docs.python.org/library/unittest.html

11273

11274

(4) http://docs.python.org/library/doctest.html

11275

11276

11277

Chapter 16 Advanced

11278

**********************

11279

11280

11281

11282

16.1 Parser Design

11283

*=*=*=*=*=*=*=*=*=*

11284

11285

11286

Many of the older Biopython parsers were built around an

11287

event-oriented design that includes Scanner and Consumer objects.

11288

Scanners take input from a data source and analyze it line by line,

11289

sending off an event whenever it recognizes some information in the

11290

data. For example, if the data includes information about an organism

11291

name, the scanner may generate an 'organism_name' event whenever it

11292

encounters a line containing the name.

11293

Consumers are objects that receive the events generated by Scanners.

11294

Following the previous example, the consumer receives the

11295

'organism_name' event, and the processes it in whatever manner necessary

11296

in the current application.

11297

This is a very flexible framework, which is advantageous if you want

11298

to be able to parse a file format into more than one representation. For

11299

example, the 'Bio.GenBank' module uses this construct either 'SeqRecord'

11300

objects or file-format-specific record objects.

11301

More recently, many of the parsers added for 'Bio.SeqIO' and

11302

'Bio.AlignIO' take a much simpler approach, but only generate a single

11303

object representation ('SeqRecord' and 'Alignment' objects

11304

respectively). In some cases the 'Bio.SeqIO' parsers actually wrap

11305

another Biopython parser - for example, the 'Bio.SwissProt' parser

11306

produces SwissProt format specific record objects, which get converted

11307

into 'SeqRecord' objects.

11308

11309

11310

16.2 Substitution Matrices

11311

*=*=*=*=*=*=*=*=*=*=*=*=*=*

11312

11313

11314

11315

11316

16.2.1 SubsMat

11317

===============

11318

11319

This module provides a class and a few routines for generating

11320

substitution matrices, similar to BLOSUM or PAM matrices, but based on

11321

user-provided data.

11322

Additionally, you may select a matrix from MatrixInfo.py, a collection

11323

of established substitution matrices.

11324

<<class SeqMat(UserDict.UserDict)

11325

>>

11326

11327

11328

11329

1. Attributes

11330

11331

11332

1. 'self.data': a dictionary in the form of '{(i1,j1):n1,

11333

(i1,j2):n2,...,(ik,jk):nk}' where i, j are alphabet letters, and n

11334

is a value.

11335

11336

2. 'self.alphabet': a class as defined in Bio.Alphabet

11337

11338

3. 'self.ab_list': a list of the alphabet's letters, sorted.

11339

Needed mainly for internal purposes

11340

11341

4. 'self.sum_letters': a dictionary. '{i1: s1, i2: s2,...,in:sn}'

11342

where:

11343

11344

1. i: an alphabet letter;

11345

2. s: sum of all values in a half-matrix for that letter;

11346

3. n: number of letters in alphabet.

11347

11348

11349

11350

2. Methods

11351

11352

11353

11354

1.

11355

<<__init__(self,data=None,alphabet=None,

11356

mat_type=NOTYPE,mat_name='',build_later=0):

11357

>>

11358

11359

11360

11361

11362

1. 'data': can be either a dictionary, or another SeqMat

11363

instance.

11364

2. 'alphabet': a Bio.Alphabet instance. If not provided,

11365

construct an alphabet from data.

11366

11367

3. 'mat_type': type of matrix generated. One of the following:

11368

11369

11370

NOTYPE No type defined

11371

ACCREP Accepted Replacements Matrix

11372

OBSFREQ Observed Frequency Matrix

11373

EXPFREQ Expsected Frequency Matrix

11374

SUBS Substitution Matrix

11375

LO Log Odds Matrix

11376

11377

'mat_type' is provided automatically by some of SubsMat's

11378

functions.

11379

11380

4. 'mat_name': matrix name, such as "BLOSUM62" or "PAM250"

11381

11382

5. 'build_later': default false. If true, user may supply only

11383

alphabet and empty dictionary, if intending to build the matrix

11384

later. this skips the sanity check of alphabet size vs. matrix

11385

size.

11386

11387

11388

2.

11389

<<entropy(self,obs_freq_mat)

11390

>>

11391

11392

11393

11394

1. 'obs_freq_mat': an observed frequency matrix. Returns the

11395

matrix's entropy, based on the frequency in 'obs_freq_mat'. The

11396

matrix instance should be LO or SUBS.

11397

11398

11399

3.

11400

<<letter_sum(self,letter)

11401

>>

11402

11403

Returns the sum of all values in the matrix, for the provided

11404

'letter'

11405

11406

4.

11407

<<all_letters_sum(self)

11408

>>

11409

11410

Fills the dictionary attribute 'self.sum_letters' with the sum of

11411

values for each letter in the matrix's alphabet.

11412

11413

5.

11414

<<print_mat(self,f,format="%4d",bottomformat="%4s",alphabet=None)

11415

>>

11416

11417

prints the matrix to file handle f. 'format' is the format field for

11418

the matrix values; 'bottomformat' is the format field for the

11419

bottom row, containing matrix letters. Example output for a

11420

3-letter alphabet matrix:

11421

<<A 23

11422

B 12 34

11423

C 7 22 27

11424

A B C

11425

>>

11426

11427

The 'alphabet' optional argument is a string of all characters in

11428

the alphabet. If supplied, the order of letters along the axes is

11429

taken from the string, rather than by alphabetical order.

11430

11431

11432

3. Usage

11433

The following section is layed out in the order by which most people

11434

wish to generate a log-odds matrix. Of course, interim matrices can

11435

be generated and investigated. Most people just want a log-odds

11436

matrix, that's all.

11437

11438

11439

11440

1. Generating an Accepted Replacement Matrix

11441

Initially, you should generate an accepted replacement matrix (ARM)

11442

from your data. The values in ARM are the counted number of

11443

replacements according to your data. The data could be a set of

11444

pairs or multiple alignments. So for instance if Alanine was

11445

replaced by Cysteine 10 times, and Cysteine by Alanine 12 times,

11446

the corresponding ARM entries would be:

11447

<<('A','C'): 10, ('C','A'): 12

11448

>>

11449

11450

as order doesn't matter, user can already provide only one entry:

11451

<<('A','C'): 22

11452

>>

11453

11454

A SeqMat instance may be initialized with either a full (first

11455

method of counting: 10, 12) or half (the latter method, 22)

11456

matrices. A full protein alphabet matrix would be of the size

11457

20x20 = 400. A half matrix of that alphabet would be 20x20/2 +

11458

20/2 = 210. That is because same-letter entries don't change. (The

11459

matrix diagonal). Given an alphabet size of N:

11460

11461

11462

1. Full matrix size:N*N

11463

11464

2. Half matrix size: N(N+1)/2

11465

11466

The SeqMat constructor automatically generates a half-matrix, if a

11467

full matrix is passed. If a half matrix is passed, letters in the

11468

key should be provided in alphabetical order: ('A','C') and not

11469

('C',A').

11470

At this point, if all you wish to do is generate a log-odds matrix,

11471

please go to the section titled Example of Use. The following text

11472

describes the nitty-gritty of internal functions, to be used by

11473

people who wish to investigate their nucleotide/amino-acid

11474

frequency data more thoroughly.

11475

11476

2. Generating the observed frequency matrix (OFM)

11477

Use:

11478

<<OFM = SubsMat._build_obs_freq_mat(ARM)

11479

>>

11480

11481

The OFM is generated from the ARM, only instead of replacement

11482

counts, it contains replacement frequencies.

11483

11484

3. Generating an expected frequency matrix (EFM)

11485

Use:

11486

<<EFM = SubsMat._build_exp_freq_mat(OFM,exp_freq_table)

11487

>>

11488

11489

11490

11491

1. 'exp_freq_table': should be a FreqTable instance. See

11492

section 16.2.2 for detailed information on FreqTable. Briefly,

11493

the expected frequency table has the frequencies of appearance

11494

for each member of the alphabet. It is implemented as a

11495

dictionary with the alphabet letters as keys, and each letter's

11496

frequency as a value. Values sum to 1.

11497

11498

The expected frequency table can (and generally should) be generated

11499

from the observed frequency matrix. So in most cases you will

11500

generate 'exp_freq_table' using:

11501

<<>>> exp_freq_table = SubsMat._exp_freq_table_from_obs_freq(OFM)

11502

>>> EFM = SubsMat._build_exp_freq_mat(OFM,exp_freq_table)

11503

>>

11504

11505

But you can supply your own 'exp_freq_table', if you wish

11506

11507

4. Generating a substitution frequency matrix (SFM)

11508

Use:

11509

<<SFM = SubsMat._build_subs_mat(OFM,EFM)

11510

>>

11511

11512

Accepts an OFM, EFM. Provides the division product of the

11513

corresponding values.

11514

11515

5. Generating a log-odds matrix (LOM)

11516

Use:

11517

<<LOM=SubsMat._build_log_odds_mat(SFM[,logbase=10,factor=10.0,roun

11518

d_digit=1])

11519

>>

11520

11521

11522

11523

1. Accepts an SFM.

11524

11525

2. 'logbase': base of the logarithm used to generate the

11526

log-odds values.

11527

11528

3. 'factor': factor used to multiply the log-odds values. Each

11529

entry is generated by log(LOM[key])*factor And rounded to the

11530

'round_digit' place after the decimal point, if required.

11531

11532

11533

11534

4. Example of use

11535

As most people would want to generate a log-odds matrix, with minimum

11536

hassle, SubsMat provides one function which does it all:

11537

<<make_log_odds_matrix(acc_rep_mat,exp_freq_table=None,logbase=10,

11538

factor=10.0,round_digit=0):

11539

>>

11540

11541

11542

11543

1. 'acc_rep_mat': user provided accepted replacements matrix

11544

2. 'exp_freq_table': expected frequencies table. Used if provided,

11545

if not, generated from the 'acc_rep_mat'.

11546

3. 'logbase': base of logarithm for the log-odds matrix. Default

11547

base 10.

11548

4. 'round_digit': number after decimal digit to which result

11549

should be rounded. Default zero.

11550

11551

11552

11553

11554

16.2.2 FreqTable

11555

=================

11556

11557

<<FreqTable.FreqTable(UserDict.UserDict)

11558

>>

11559

11560

11561

11562

11563

1. Attributes:

11564

11565

11566

1. 'alphabet': A Bio.Alphabet instance.

11567

2. 'data': frequency dictionary

11568

3. 'count': count dictionary (in case counts are provided).

11569

11570

11571

2. Functions:

11572

11573

1. 'read_count(f)': read a count file from stream f. Then convert

11574

to frequencies

11575

2. 'read_freq(f)': read a frequency data file from stream f. Of

11576

course, we then don't have the counts, but it is usually the

11577

letter frquencies which are interesting.

11578

11579

11580

3. Example of use: The expected count of the residues in the database

11581

is sitting in a file, whitespace delimited, in the following format

11582

(example given for a 3-letter alphabet):

11583

<<A 35

11584

B 65

11585

C 100

11586

>>

11587

11588

And will be read using the 'FreqTable.read_count(file_handle)'

11589

function.

11590

An equivalent frequency file:

11591

<<A 0.175

11592

B 0.325

11593

C 0.5

11594

>>

11595

11596

Conversely, the residue frequencies or counts can be passed as a

11597

dictionary. Example of a count dictionary (3-letter alphabet):

11598

<<{'A': 35, 'B': 65, 'C': 100}

11599

>>

11600

11601

Which means that an expected data count would give a 0.5 frequency for

11602

'C', a 0.325 probability of 'B' and a 0.175 probability of 'A' out of

11603

200 total, sum of A, B and C)

11604

A frequency dictionary for the same data would be:

11605

<<{'A': 0.175, 'B': 0.325, 'C': 0.5}

11606

>>

11607

11608

Summing up to 1.

11609

When passing a dictionary as an argument, you should indicate whether

11610

it is a count or a frequency dictionary. Therefore the FreqTable

11611

class constructor requires two arguments: the dictionary itself, and

11612

FreqTable.COUNT or FreqTable.FREQ indicating counts or frequencies,

11613

respectively.

11614

Read expected counts. readCount will already generate the frequencies

11615

Any one of the following may be done to geerate the frequency table

11616

(ftab):

11617

<<>>> from SubsMat import *

11618

>>> ftab =

11619

FreqTable.FreqTable(my_frequency_dictionary,FreqTable.FREQ)

11620

>>> ftab = FreqTable.FreqTable(my_count_dictionary,FreqTable.COUNT)

11621

>>> ftab = FreqTable.read_count(open('myCountFile'))

11622

>>> ftab = FreqTable.read_frequency(open('myFrequencyFile'))

11623

>>

11624

11625

11626

11627

11628

Chapter 17 Where to go from here -- contributing to Biopython

11629

****************************************************************

11630

11631

11632

11633

17.1 Bug Reports + Feature Requests

11634

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

11635

11636

11637

Getting feedback on the Biopython modules is very important to us.

11638

Open-source projects like this benefit greatly from feedback,

11639

bug-reports (and patches!) from a wide variety of contributors.

11640

The main forums for discussing feature requests and potential bugs are

11641

the Biopython mailing lists (1):

11642

11643

11644

- biopython@biopython.org -- An unmoderated list for discussion of

11645

anything to do with Biopython.

11646

11647

- biopython-dev@biopython.org -- A more development oriented list

11648

that is mainly used by developers (but anyone is free to

11649

contribute!).

11650

11651

Additionally, if you think you've found a bug, you can submit it to

11652

our bug-tracking page at http://bugzilla.open-bio.org/. This way, it

11653

won't get buried in anyone's Inbox and forgotten about.

11654

11655

11656

17.2 Mailing lists and helping newcomers

11657

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

11658

11659

11660

We encourage all our uses to sign up to the main Biopython mailing

11661

list. Once you've got the hang of an area of Biopython, we'd encourage

11662

you to help answer questions from beginners. After all, you were a

11663

beginner once.

11664

11665

11666

17.3 Contributing Documentation

11667

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

11668

11669

11670

We're happy to take feedback or contributions - either via a

11671

bug-report or on the Mailing List. While reading this tutorial, perhaps

11672

you noticed some topics you were interested in which were missing, or

11673

not clearly explained. There is also Biopython's built in documentation

11674

(the docstrings, these are also online (2)), where again, you may be

11675

able to help fill in any blanks.

11676

11677

11678

17.4 Contributing cookbook examples

11679

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

11680

11681

As explained in Chapter 14, Biopython now has a wiki collection of

11682

user contributed "cookbook" examples,

11683

http://biopython.org/wiki/Category:Cookbook -- maybe you can add to

11684

this?

11685

11686

11687

17.5 Maintaining a distribution for a platform

11688

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

11689

11690

11691

We currently provide source code archives (suitable for any OS, if you

11692

have the right build tools installed), and Windows Installers which are

11693

just click and run. This covers all the major operating systems.

11694

Most major Linux distributions have volunteers who take these source

11695

code releases, and compile them into packages for Linux users to easily

11696

install (taking care of dependencies etc). This is really great and we

11697

are of course very grateful. If you would like to contribute to this

11698

work, please find out more about how your Linux distribution handles

11699

this.

11700

Below are some tips for certain platforms to maybe get people started

11701

with helping out:

11702

11703

11704

11705

Windows -- Windows products typically have a nice graphical installer

11706

that installs all of the essential components in the right place. We

11707

use Distutils to create a installer of this type fairly easily.

11708

You must first make sure you have a C compiler on your Windows

11709

computer, and that you can compile and install things (this is the

11710

hard bit - see the Biopython installation instructions for info on

11711

how to do this).

11712

Once you are setup with a C compiler, making the installer just

11713

requires doing:

11714

<<python setup.py bdist_wininst

11715

>>

11716

11717

Now you've got a Windows installer. Congrats! At the moment we have no

11718

trouble shipping installers built on 32 bit windows. If anyone would

11719

like to look into supporting 64 bit Windows that would be great.

11720

11721

RPMs -- RPMs are pretty popular package systems on some Linux

11722

platforms. There is lots of documentation on RPMs available at

11723

http://www.rpm.org to help you get started with them. To create an

11724

RPM for your platform is really easy. You just need to be able to

11725

build the package from source (having a C compiler that works is thus

11726

essential) -- see the Biopython installation instructions for more

11727

info on this.

11728

To make the RPM, you just need to do:

11729

<<python setup.py bdist_rpm

11730

>>

11731

11732

This will create an RPM for your specific platform and a source RPM in

11733

the directory 'dist'. This RPM should be good and ready to go, so

11734

this is all you need to do! Nice and easy.

11735

11736

Macintosh -- Since Apple moved to Mac OS X, things have become much

11737

easier on the Mac. We generally treat it as just another Unix

11738

variant, and installing Biopython from source is just as easy as on

11739

Linux. The easiest way to get all the GCC compilers etc installed is

11740

to install Apple's X-Code. We might be able to provide click and run

11741

installers for Mac OS X, but to date there hasn't been any demand.

11742

11743

Once you've got a package, please test it on your system to make sure

11744

it installs everything in a good way and seems to work properly. Once

11745

you feel good about it, send it off to one of the Biopython developers

11746

(write to our main mailing list at biopython@biopython.org if you're not

11747

sure who to send it to) and you've done it. Thanks!

11748

11749

11750

17.6 Contributing Unit Tests

11751

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*

11752

11753

11754

Even if you don't have any new functionality to add to Biopython, but

11755

you want to write some code, please consider extending our unit test

11756

coverage. We've devoted all of Chapter 15 to this topic.

11757

11758

11759

17.7 Contributing Code

11760

*=*=*=*=*=*=*=*=*=*=*=*

11761

11762

11763

There are no barriers to joining Biopython code development other than

11764

an interest in creating biology-related code in Python. The best place

11765

to express an interest is on the Biopython mailing lists -- just let us

11766

know you are interested in coding and what kind of stuff you want to

11767

work on. Normally, we try to have some discussion on modules before

11768

coding them, since that helps generate good ideas -- then just feel free

11769

to jump right in and start coding!

11770

The main Biopython release tries to be fairly uniform and

11771

interworkable, to make it easier for users. You can read about some of

11772

(fairly informal) coding style guidelines we try to use in Biopython in

11773

the contributing documentation at

11774

http://biopython.org/wiki/Contributing. We also try to add code to the

11775

distribution along with tests (see Chapter 15 for more info on the

11776

regression testing framework) and documentation, so that everything can

11777

stay as workable and well documented as possible (including docstrings).

11778

This is, of course, the most ideal situation, under many situations

11779

you'll be able to find other people on the list who will be willing to

11780

help add documentation or more tests for your code once you make it

11781

available. So, to end this paragraph like the last, feel free to start

11782

working!

11783

Please note that to make a code contribution you must have the legal

11784

right to contribute it and license it under the Biopython license. If

11785

you wrote it all yourself, and it is not based on any other code, this

11786

shouldn't be a problem. However, there are issues if you want to

11787

contribute a derivative work - for example something based on GPL or

11788

LPGL licenced code would not be compatible with our license. If you have

11789

any queries on this, please discuss the issue on the biopython-dev

11790

mailing list.

11791

Another point of concern for any additions to Biopython regards any

11792

build time or run time dependencies. Generally speaking, writing code to

11793

interact with a standalone tool (like BLAST, EMBOSS or ClustalW) doesn't

11794

present a big problem. However, any dependency on another library - even

11795

a Python library (especially one needed in order to compile and install

11796

Biopython like NumPy) would need further discussion.

11797

Additionally, if you have code that you don't think fits in the

11798

distribution, but that you want to make available, we maintain Script

11799

Central (http://biopython.org/wiki/Scriptcentral) which has pointers to

11800

freely available code in Python for bioinformatics.

11801

Hopefully this documentation has got you excited enough about

11802

Biopython to try it out (and most importantly, contribute!). Thanks for

11803

reading all the way through!

11804

-----------------------------------

11805

11806

11807

(1) http://biopython.org/wiki/Mailing_lists

11808

11809

(2) http://biopython.org/DIST/docs/api

11810

11811

11812

Chapter 18 Appendix: Useful stuff about Python

11813

*************************************************

11814

11815

If you haven't spent a lot of time programming in Python, many

11816

questions and problems that come up in using Biopython are often related

11817

to Python itself. This section tries to present some ideas and code that

11818

come up often (at least for us!) while using the Biopython libraries. If

11819

you have any suggestions for useful pointers that could go here, please

11820

contribute!

11821

11822

11823

18.1 What the heck is a handle?

11824

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

11825

11826

11827

Handles are mentioned quite frequently throughout this documentation,

11828

and are also fairly confusing (at least to me!). Basically, you can

11829

think of a handle as being a "wrapper" around text information.

11830

Handles provide (at least) two benefits over plain text information:

11831

11832

11833

1. They provide a standard way to deal with information stored in

11834

different ways. The text information can be in a file, or in a string

11835

stored in memory, or the output from a command line program, or at

11836

some remote website, but the handle provides a common way of dealing

11837

with information in all of these formats.

11838

11839

2. They allow text information to be read incrementally, instead of

11840

all at once. This is really important when you are dealing with huge

11841

text files which would use up all of your memory if you had to load

11842

them all.

11843

11844

Handles can deal with text information that is being read

11845

(e. g. reading from a file) or written (e. g. writing information to a

11846

file). In the case of a "read" handle, commonly used functions are

11847

'read()', which reads the entire text information from the handle, and

11848

'readline()', which reads information one line at a time. For "write"

11849

handles, the function 'write()' is regularly used.

11850

The most common usage for handles is reading information from a file,

11851

which is done using the built-in Python function 'open'. Here, we open a

11852

handle to the file m_cold.fasta (1) (also available online here (2)):

11853

<<>>> handle = open("m_cold.fasta", "r")

11854

>>> handle.readline()

11855

">gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthemum ...\n"

11856

>>

11857

11858

Handles are regularly used in Biopython for passing information to

11859

parsers.

11860

11861

11862

18.1.1 Creating a handle from a string

11863

=======================================

11864

11865

One useful thing is to be able to turn information contained in a

11866

string into a handle. The following example shows how to do this using

11867

'cStringIO' from the Python standard library:

11868

<<>>> my_info = 'A string\n with multiple lines.'

11869

>>> print my_info

11870

A string

11871

with multiple lines.

11872

>>> import cStringIO

11873

>>> my_info_handle = cStringIO.StringIO(my_info)

11874

>>> first_line = my_info_handle.readline()

11875

>>> print first_line

11876

A string

11877

11878

>>> second_line = my_info_handle.readline()

11879

>>> print second_line

11880

with multiple lines.

11881

>>

11882

11883

-----------------------------------------------------------------------

11884

11885

This document was translated from LaTeX by HeVeA (3).

11886

-----------------------------------

11887

11888

11889

(1) examples/m_cold.fasta

11890

11891

(2) http://biopython.org/DIST/docs/tutorial/examples/m_cold.fasta

11892

11893

(3) http://hevea.inria.fr/index.html