7
.. authors, Gavin Huttley, Tom Elliott
7
.. authors, Gavin Huttley, Tom Elliott, Jeremy Widmann
12
12
In order to run BLAST locally (from a program running on your computer) you will need to do three things:
14
- download the BLAST "executables" from NCBI
15
- make sure these programs are available on your ``PATH``
16
- construct and format a database to search against
14
- Download the BLAST "executables" from NCBI
15
- Make sure these programs are available on your ``PATH``
16
- Construct and format a database to search against
18
18
NCBI has recently changed the BLAST programs, and as yet PyCogent does not support the new versions. The "legacy" versions are available from `here <http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download>`_ (login as guest).
20
Detailed instructions are beyond the scope of this example, but after downloading the programs and setting up your ``PATH``, you should test BLAST by doing this from the command line:
21
Detailed installation instructions are beyond the scope of this example, but are available at `NCBI's website <http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/unix_setup.html>`_ .
22
After downloading the programs and setting up your ``PATH``, you should test BLAST by doing this from the command line:
209
214
>>> remove_files(['data/blast_test.txt', 'data/blast_test.xml'],
210
215
... error_on_missing=False)
217
BLAST with protein sequences
218
----------------------------
220
In this example, we load a protein sequence from a file in the data directory and BLAST against a new protein database we will create. Since we want to BLAST protein sequences instead of DNA, we will have to construct a new BLAST database.
222
The file ``refseqs_protein.fasta`` contains some short sequences for use in the following examples.
226
>>> from cogent.app.formatdb import build_blast_db_from_fasta_path
227
>>> result = build_blast_db_from_fasta_path('data/refseqs_protein.fasta', is_protein=True)
229
'data/refseqs_protein.fasta'...
231
Notice that we set the parameter ``is_protein`` to ``True`` since our database consists of protein sequences this time. This was not necessary in the previous example, because ``is_protein`` is set to ``False`` by default.
233
Now that we have built our protein BLAST database, we can load our sequence and BLAST against this database.
237
>>> from cogent import LoadSeqs, PROTEIN
238
>>> from cogent.app.blast import blast_seqs, Blastall
239
>>> seqs = LoadSeqs('data/inseqs_protein.fasta', moltype=PROTEIN, aligned=False)
240
>>> seq = seqs.getSeq('1091044_fragment')
242
ProteinSequence(IPLDFDK... 26)
244
Notice we need to use '-p':'blastp' in the parameters dictionary, since ``blastp`` is used for protein.
248
>>> params={'-p':'blastp','-m':'9'}
249
>>> result = blast_seqs([seq],
251
... blast_db = 'data/refseqs_protein.fasta',
253
>>> data = result['StdOut'].read()
254
>>> print data.split('\n')[:1]
257
We save the results for further processing
261
>>> outfile = open('data/blast_protein_test.txt','w')
262
>>> outfile.write(data)
265
Now we will explore some of the convenience methods of the ``BlastResult`` object.
269
>>> from cogent.parse.blast import BlastResult
270
>>> blast_results = BlastResult(open('data/blast_protein_test.txt','r'))
272
Suppose we want to filter our results based on various criteria. In many cases you may want to only keep the top '3' matches with the longest 'ALIGNMENT LENGTH' for the query sequence to the target.
276
>>> best_hits = dict(blast_results.bestHitsByQuery(field='ALIGNMENT LENGTH', n=3))
277
>>> query_1_best_hits = best_hits['1']
278
>>> for hit in query_1_best_hits:
279
... for key, value in hit.items():
280
... print key.ljust(20), value
323
The fist of the top 3 hits for alignment length has 0 MISMATCHES and a % IDENTITY of 100.00. The next 2 hits have many MISMATCHES and a much lower % IDENTITY. Lets filter the results again, but by E-VALUE this time:
327
>>> best_hits = dict(blast_results.bestHitsByQuery(field='E-VALUE', n=3))
328
>>> query_1_best_hits = best_hits['1']
329
>>> for hit in query_1_best_hits:
330
... for key, value in hit.items():
331
... print key.ljust(20), value
374
You can filter the BLAST results by any of the fields you like. You can also use the ``BlastResult`` object to do a quick assessment of your BLAST results looking only at the fields you like:
378
>>> fields = ['SUBJECT ID', 'BIT SCORE', 'E-VALUE']
379
>>> for query, results in blast_results.items():
380
... print ''.join([f.ljust(20) for f in fields])
381
... for result in results[-1]:
382
... print ''.join(map(str,[result[field].ljust(20) for field in fields]))
383
SUBJECT ID BIT SCORE E-VALUE
404
>>> from cogent.util.misc import remove_files
405
>>> remove_files(['data/blast_protein_test.txt'],
406
... error_on_missing=False)