11
The development of BEDTools was motivated by a need for fast, flexible tools with which to compare large sets of genomic
12
features. Answering fundamental research questions with existing tools was either too slow or required modifications to the
13
way they reported or computed their results. We were aware of the utilities on the UCSC Genome Browser and Galaxy websites, as
14
well as the elegant tools available as part of Jim Kent’s monolithic suite of tools (“Kent source”). However, we found that
15
the web-based tools were too cumbersome when working with large datasets generated by current sequencing technologies.
16
Similarly, we found that the Kent source command line tools often required a local installation of the UCSC Genome Browser.
17
These limitations, combined with the fact that we often wanted an extra option here or there that wasn’t available with
18
existing tools, led us to develop our own from scratch. The initial version of BEDTools was publicly released in the spring of
19
2009. The current version has evolved from our research experiences and those of the scientists using the suite over the last
20
year. The BEDTools suite enables one to answer common questions of genomic data in a fast and reliable manner. The fact that
21
almost all the utilities accept input from “stdin” allows one to “stream / pipe” several commands together to facilitate more
22
complicated analyses. Also, the tools allow fine control over how output is reported. The initial version of BEDTools
23
supported solely 6-column `BED <http://genome.ucsc.edu/FAQ/FAQformat#format1>`_ files. *However, we have subsequently added support for sequence alignments in* `BAM <http://samtools.sourceforge.net/>`_
24
*format, as well as for features in* `GFF <http://genome.ucsc.edu/FAQ/FAQformat#format3>`_ , *“blocked” BED format, and*
25
`VCF <http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41>`_ *format*.
26
The tools are quite fast and typically finish in a matter of a few seconds, even for large datasets. This manual seeks to describe the behavior and
27
available functionality for each BEDTool. Usage examples are scattered throughout the text, and formal examples are
28
provided in the last two sections, we hope that this document will give you a sense of the flexibility of
29
the toolkit and the types of analyses that are possible with BEDTools. If you have further questions, please join the BEDTools
30
discussion group, visit the Usage Examples on the Google Code site (usage, advanced usage), or take a look at the nascent
31
“Usage From the Wild” page.
33
===========================
34
1.2 Summary of available tools.
35
===========================
37
BEDTools support a wide range of operations for interrogating and manipulating genomic features. The table below summarizes
38
the tools available in the suite.
40
=========================== =========================================================================================================================================================
42
=========================== =========================================================================================================================================================
43
**intersectBed** Returns overlaps between two BED/GFF/VCF files.
44
**pairToBed** Returns overlaps between a paired-end BED file and a regular BED/VCF/GFF file.
45
**bamToBed** Converts BAM alignments to BED6, BED12, or BEDPE format.
46
**bedToBam** Converts BED/GFF/VCF features to BAM format.
47
**bed12ToBed6** Converts "blocked" BED12 features to discrete BED6 features.
48
**bedToIgv** Creates IGV batch scripts for taking multiple snapshots from BED/GFF/VCF features.
49
**coverageBed** Summarizes the depth and breadth of coverage of features in one BED versus features (e.g, windows, exons, etc.) defined in another BED/GFF/VCF file.
50
**multiBamCov** Counts sequence coverage for multiple position-sorted bams at specific loci defined in a BED/GFF/VCF file
51
**tagBam** Annotates a BAM file with custom tag fields based on overlaps with BED/GFF/VCF files
52
**nuclBed** Profiles the nucleotide content of intervals in a fasta file
53
**genomeCoverageBed** Creates either a histogram, BEDGRAPH, or a "per base" report of genome coverage.
54
**unionBedGraphs** Combines multiple BedGraph? files into a single file, allowing coverage/other comparisons between them.
55
**annotateBed** Annotates one BED/VCF/GFF file with overlaps from many others.
56
**groupBy** Deprecated. Now in the filo package.
57
**overlap** Returns the number of bases pairs of overlap b/w two features on the same line.
58
**pairToPair** Returns overlaps between two paired-end BED files.
59
**closestBed** Returns the closest feature to each entry in a BED/GFF/VCF file.
60
**subtractBed** Removes the portion of an interval that is overlapped by another feature.
61
**windowBed** Returns overlaps between two BED/VCF/GFF files based on a user-defined window.
62
**mergeBed** Merges overlapping features into a single feature.
63
**complementBed** Returns all intervals not spanned by the features in a BED/GFF/VCF file.
64
**fastaFromBed** Creates FASTA sequences based on intervals in a BED/GFF/VCF file.
65
**maskFastaFromBed** Masks a FASTA file based on BED coordinates.
66
**shuffleBed** Randomly permutes the locations of a BED file among a genome.
67
**slopBed** Adjusts each BED entry by a requested number of base pairs.
68
**flankBed** Creates flanking intervals for each feature in a BED/GFF/VCF file.
69
**sortBed** Sorts a BED file by chrom, then start position. Other ways as well.
70
**linksBed** Creates an HTML file of links to the UCSC or a custom browser.
71
=========================== =========================================================================================================================================================
78
===========================
79
1.3 Fundamental concepts.
80
===========================
81
------------------------------------------------------
82
1.3.1 What are genome features and how are they represented?
83
------------------------------------------------------
84
Throughout this manual, we will discuss how to use BEDTools to manipulate, compare and ask questions of genome “features”. Genome features can be functional elements (e.g., genes), genetic polymorphisms (e.g.
85
SNPs, INDELs, or structural variants), or other annotations that have been discovered or curated by genome sequencing groups or genome browser groups. In addition, genome features can be custom annotations that
86
an individual lab or researcher defines (e.g., my novel gene or variant).
88
The basic characteristics of a genome feature are the chromosome or scaffold on which the feature “resides”, the base pair on which the
89
feature starts (i.e. the “start”), the base pair on which feature ends (i.e. the “end”), the strand on which the feature exists (i.e. “+” or “-“), and the name of the feature if one is applicable.
91
The two most widely used formats for representing genome features are the BED (Browser Extensible Data) and GFF (General Feature Format) formats. BEDTools was originally written to work exclusively with genome features
92
described using the BED format, but it has been recently extended to seamlessly work with BED, GFF and VCF files.
94
Existing annotations for the genomes of many species can be easily downloaded in BED and GFF
95
format from the UCSC Genome Browser’s “Table Browser” (http://genome.ucsc.edu/cgi-bin/hgTables?command=start) or from the “Bulk Downloads” page (http://hgdownload.cse.ucsc.edu/downloads.html). In addition, the
96
Ensemble Genome Browser contains annotations in GFF/GTF format for many species (http://www.ensembl.org/info/data/ftp/index.html)
98
-------------------------------------
99
1.3.2 Overlapping / intersecting features.
100
-------------------------------------
101
Two genome features (henceforth referred to as “features”) are said to overlap or intersect if they share at least one base in common.
102
In the figure below, Feature A intersects/overlaps Feature B, but it does not intersect/overlap Feature C.
104
**TODO: place figure here**
106
--------------------------------------------
107
1.3.3 Comparing features in file “A” and file “B”.
108
--------------------------------------------
109
The previous section briefly introduced a fundamental naming convention used in BEDTools. Specifically, all BEDTools that compare features contained in two distinct files refer to one file as feature set “A” and the other file as feature set “B”. This is mainly in the interest of brevity, but it also has its roots in set theory.
110
As an example, if one wanted to look for SNPs (file A) that overlap with exons (file B), one would use intersectBed in the following manner::
112
intersectBed –a snps.bed –b exons.bed
114
There are two exceptions to this rule: 1) When the “A” file is in BAM format, the “-abam” option must bed used. For example::
116
intersectBed –abam alignedReads.bam –b exons.bed
118
And 2) For tools where only one input feature file is needed, the “-i” option is used. For example::
120
mergeBed –i repeats.bed
122
-----------------------------------------------------
123
1.3.4 BED starts are zero-based and BED ends are one-based.
124
-----------------------------------------------------
125
BEDTools users are sometimes confused by the way the start and end of BED features are represented. Specifically, BEDTools uses the UCSC Genome Browser’s internal database convention of making the start position 0-based and the end position 1-based: (http://genome.ucsc.edu/FAQ/FAQtracks#tracks1)
126
In other words, BEDTools interprets the “start” column as being 1 basepair higher than what is represented in the file. For example, the following BED feature represents a single base on chromosome 1; namely, the 1st base::
130
Why, you might ask? The advantage of storing features this way is that when computing the length of a feature, one must simply subtract the start from the end. Were the start position 1-based,
131
the calculation would be (slightly) more complex (i.e. (end-start)+1). Thus, storing BED features this way reduces the computational burden.
133
-----------------------------------------------------
134
1.3.5 GFF starts and ends are one-based.
135
-----------------------------------------------------
136
In contrast, the GFF format uses 1-based coordinates for both the start and the end positions. BEDTools is aware of this and adjusts the positions accordingly.
137
In other words, you don’t need to subtract 1 from the start positions of your GFF features for them to work correctly with BEDTools.
139
-----------------------------------------------------
140
1.3.6 VCF coordinates are one-based.
141
-----------------------------------------------------
142
The VCF format uses 1-based coordinates. As in GFF, BEDTools is aware of this and adjusts the positions accordingly.
143
In other words, you don’t need to subtract 1 from the start positions of your VCF features for them to work correctly with BEDTools.
145
-----------------------------------------------------
146
1.3.7 File B is loaded into memory (most of the time).
147
-----------------------------------------------------
148
Whenever a BEDTool compares two files of features, the “B” file is loaded into memory. By contrast, the “A” file is processed line by line and compared with the features from B.
149
Therefore to minimize memory usage, one should set the smaller of the two files as the B file. One salient example is the comparison of aligned sequence reads from a
150
current DNA sequencer to gene annotations. In this case, the aligned sequence file (in BED format) may have tens of millions of features (the sequence alignments),
151
while the gene annotation file will have tens of thousands of features. In this case, it is wise to sets the reads as file A and the genes as file B.
153
-----------------------------------------------------
154
1.3.8 Feature files *must* be tab-delimited.
155
-----------------------------------------------------
156
This is rather self-explanatory. While it is possible to allow BED files to be space-delimited, we have decided to require tab delimiters for three reasons:
158
1. By requiring one delimiter type, the processing time is minimized.
159
2. Tab-delimited files are more amenable to other UNIX utilities.
160
3. GFF files can contain spaces within attribute columns. This complicates the use of space-delimited files as spaces must therefore be treated specially depending on the context.
162
-------------------------------------------------------------
163
1.3.9 All BEDTools allow features to be “piped” via standard input.
164
-------------------------------------------------------------
166
In an effort to allow one to combine multiple BEDTools and other UNIX utilities into more complicated “pipelines”, all BEDTools allow features
167
to be passed to them via standard input. Only one feature file may be passed to a BEDTool via standard input.
168
The convention used by all BEDTools is to set either file A or file B to “stdin” or "-". For example::
170
cat snps.bed | intersectBed –a stdin –b exons.bed
171
cat snps.bed | intersectBed –a - –b exons.bed
173
In addition, all BEDTools that simply require one main input file (the -i file) will assume that input is
174
coming from standard input if the -i parameter is ignored. For example, the following are equivalent::
176
cat snps.bed | sortBed –i stdin
177
cat snps.bed | sortBed
179
------------------------------------------------------
180
1.3.10 Most BEDTools write their results to standard output.
181
------------------------------------------------------
182
To allow one to combine multiple BEDTools and other UNIX utilities into more complicated “pipelines”,
183
most BEDTools report their output to standard output, rather than to a named file. If one wants to write the output to a named file, one can use the UNIX “file redirection” symbol “>” to do so.
184
Writing to standard output (the default)::
186
intersectBed –a snps.bed –b exons.bed
187
chr1 100100 100101 rs233454
188
chr1 200100 200101 rs446788
189
chr1 300100 300101 rs645678
193
intersectBed –a snps.bed –b exons.bed > snps.in.exons.bed
195
cat snps.in.exons.bed
196
chr1 100100 100101 rs233454
197
chr1 200100 200101 rs446788
198
chr1 300100 300101 rs645678
200
------------------------
201
1.3.11 What is a “genome” file?
202
------------------------
203
Some of the BEDTools (e.g., genomeCoverageBed, complementBed, slopBed) need to know the size of
204
the chromosomes for the organism for which your BED files are based. When using the UCSC Genome
205
Browser, Ensemble, or Galaxy, you typically indicate which species / genome build you are working.
206
The way you do this for BEDTools is to create a “genome” file, which simply lists the names of the
207
chromosomes (or scaffolds, etc.) and their size (in basepairs).
208
Genome files must be tab-delimited and are structured as follows (this is an example for C. elegans)::
216
BEDTools includes predefined genome files for human and mouse in the /genomes directory included
217
in the BEDTools distribution. Additionally, the “chromInfo” files/tables available from the UCSC
218
Genome Browser website are acceptable. For example, one can download the hg19 chromInfo file here:
219
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz
221
------------------------------------
222
1.3.12 Paired-end BED files (BEDPE files).
223
------------------------------------
224
We have defined a new file format (BEDPE) to concisely describe disjoint genome features, such as
225
structural variations or paired-end sequence alignments. We chose to define a new format because the
226
existing BED block format (i.e. BED12) does not allow inter-chromosomal feature definitions. Moreover,
227
the BED12 format feels rather bloated when one want to describe events with only two blocks.
229
------------------------------------------
230
1.3.13 Use “-h” for help with any BEDTool.
231
------------------------------------------
232
Rather straightforward. If you use the “-h” option with any BEDTool, a full menu of example usage
233
and available options (when applicable) will be reported.
235
--------------------------------------------------
236
1.3.14 BED features must not contain negative positions.
237
--------------------------------------------------
238
BEDTools will typically reject BED features that contain negative positions. In special cases, however,
239
BEDPE positions may be set to -1 to indicate that one or more ends of a BEDPE feature is unaligned.
241
---------------------------------------------------
242
1.3.15 The start position must be <= to the end position.
243
---------------------------------------------------
244
BEDTools will reject BED features where the start position is greater than the end position.
246
-----------------------------------------
247
1.3.16 Headers are allowed in GFF and BED files
248
-----------------------------------------
249
BEDTools will ignore headers at the beginning of BED and GFF files. Valid header lines begin with a
250
“#” symbol, the work “track”, or the word “browser”. For example, the following examples are valid
251
headers for BED or GFF files::
253
track name=aligned_read description="Illumina aligned reads”
254
chr5 100000 500000 read1 50 +
255
chr5 2380000 2386000 read2 60 -
257
#This is a fascinating dataset
258
chr5 100000 500000 read1 50 +
259
chr5 2380000 2386000 read2 60 -
261
browser position chr22:1-20000
262
chr5 100000 500000 read1 50 +
263
chr5 2380000 2386000 read2 60 -
265
-------------------------------------------------------------
266
1.3.17 GZIP support: BED, GFF, VCF, and BEDPE file can be “gzipped”
267
-------------------------------------------------------------
268
BEDTools will process gzipped BED, GFF, VCF and BEDPE files in the same manner as
269
uncompressed files. Gzipped files are auto-detected thanks to a helpful contribution from Gordon
272
----------------------------------------------------------------------------
273
1.3.18 Support for “split” or “spliced” BAM alignments and “blocked” BED features
274
----------------------------------------------------------------------------
275
As of Version 2.8.0, five BEDTools (``intersectBed``, ``coverageBed``, ``genomeCoverageBed``,
276
``bamToBed``, and ``bed12ToBed6``) can properly handle “split”/”spliced” BAM alignments (i.e., having an
277
“N” CIGAR operation) and/or “blocked” BED (aka BED12) features.
279
``intersectBed``, ``coverageBed``, and ``genomeCoverageBed`` will optionally handle “split” BAM and/or
280
“blocked” BED by using the ``-split`` option. This will cause intersects or coverage to be computed only
281
for the alignment or feature blocks. In contrast, without this option, the intersects/coverage would be
282
computed for the entire “span” of the alignment or feature, regardless of the size of the gaps between
283
each alignment or feature block. For example, imagine you have a RNA-seq read that originates from
284
the junction of two exons that were spliced together in a mRNA. In the genome, these two exons
285
happen to be 30Kb apart. Thus, when the read is aligned to the reference genome, one portion of the
286
read will align to the first exon, while another portion of the read will align ca. 30Kb downstream to the
287
other exon. The corresponding CIGAR string would be something like (assuming a 76bp read):
288
30M*3000N*46M. In the genome, this alignment “spans” 3076 bp, yet the nucleotides in the sequencing
289
read only align “cover” 76bp. Without the ``-split`` option, coverage or overlaps would be reported for the
290
entire 3076bp span of the alignment. However, with the ``-split`` option, coverage or overlaps will only
291
be reported for the portions of the read that overlap the exons (i.e. 30bp on one exon, and
295
Using the -split option with bamToBed causes “spliced/split” alignments to be reported in BED12
296
format. Using the -split option with bed12ToBed6 causes “blocked” BED12 features to be reported in
299
--------------------------------
300
1.3.19 Writing uncompressed BAM output.
301
--------------------------------
302
When working with a large BAM file using a complex set of tools in a pipe/stream, it is advantageous
303
to pass uncompressed BAM output to each downstream program. This minimizes the amount of time
304
spent compressing and decompressing output from one program to the next. All BEDTools that create
305
BAM output (e.g. ``intersectBed``, ``windowBed``) will now optionally create uncompressed BAM output
306
using the ``-ubam`` option.
310
=====================================
311
1.4 Implementation and algorithmic notes.
312
=====================================
313
BEDTools was implemented in C++ and makes extensive use of data structures and fundamental
314
algorithms from the Standard Template Library (STL). Many of the core algorithms are based upon the
315
genome binning algorithm described in the original UCSC Genome Browser paper (Kent et al, 2002).
316
The tools have been designed to inherit core data structures from central source files, thus allowing
317
rapid tool development and deployment of improvements and corrections. Support for BAM files is
318
made possible through Derek Barnett’s elegant C++ API called BamTools.
322
=====================================
323
1.5 License and availability.
324
=====================================
325
BEDTools is freely available under a GNU Public License (Version 2) at:
326
http://bedtools.googlecode.com
330
=====================================
332
=====================================
333
A discussion group for reporting bugs, asking questions of the developer and of the user community, as
334
well as for requesting new features is available at:
335
http://groups.google.com/group/bedtools-discuss
339
=====================================
341
=====================================
342
As open-source software, BEDTools greatly benefits from contributions made by other developers and
343
users of the tools. We encourage and welcome suggestions, contributions and complaints. This is how
344
software matures, improves and stays on top of the needs of its user community. The Google Code
345
(GC) site maintains a list of individuals who have contributed either source code or useful ideas for
346
improving the tools. In the near future, we hope to maintain a source repository on the GC site in
347
order to facilitate further contributions. We are currently unable to do so because we use Git for
348
version control, which is not yet supported by GC.
b'\\ No newline at end of file'