1
%\VignetteIndexEntry{An Introduction to GenomicRanges}
2
%\VignetteDepends{Rsamtools}
3
%\VignetteKeywords{sequence, sequencing, alignments}
4
%\VignettePackage{GenomicRanges}
5
\documentclass[10pt]{article}
17
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
18
\newcommand{\Robject}[1]{{\texttt{#1}}}
19
\newcommand{\Rpackage}[1]{{\textit{#1}}}
20
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
21
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
22
\newcommand{\Rclass}[1]{{\textit{#1}}}
23
\newcommand{\Rcode}[1]{{\texttt{#1}}}
25
\newcommand{\software}[1]{\textsf{#1}}
26
\newcommand{\R}{\software{R}}
27
\newcommand{\Bioconductor}{\software{Bioconductor}}
28
\newcommand{\GenomicRanges}{\Rpackage{GenomicRanges}}
31
\title{An Introduction to Genomic Ranges Classes}
32
\author{Marc Carlson \and Patrick Aboyoun \and Herv\'{e} Pag\`{e}s}
39
<<options,echo=FALSE>>=
44
\section{Introduction}
46
The \Rpackage{GenomicRanges} package serves as the foundation for
47
representing genomic locations within the \software{Bioconductor}
48
project. In the \software{Bioconductor} package hierarchy, it builds
49
upon the \Rpackage{IRanges} (infrastructure) package and provides
50
support for the \Rpackage{BSgenome} (infrastructure),
51
\Rpackage{Rsamtools} (I/O), \Rpackage{ShortRead} (I/O \& QA),
52
\Rpackage{rtracklayer} (I/O), and \Rpackage{GenomicFeatures}
53
(infrastructure) packages.
55
This package lays a foundation for genomic analysis by introducing
56
three classes (\Rclass{GRanges}, \Rclass{GRangesList}, and
57
\Rclass{GAlignments}), which are used to represent single
58
interval range features, multiple interval range features, and genomic
59
alignments respectively. This vignette focuses on these classes and
60
their associated methods.
62
The \Rpackage{GenomicRanges} package is available at bioconductor.org
63
and can be downloaded via \Rfunction{biocLite}:
65
<<biocLite, eval=FALSE>>=
66
source("http://bioconductor.org/biocLite.R")
67
biocLite("GenomicRanges")
69
<<initialize, results=hide>>=
70
library(GenomicRanges)
74
\section{\Rclass{GRanges}: Single Interval Range Features}
76
The \Rclass{GRanges} class represents a collection of genomic features
77
that each have a single start and end location on the genome. This
78
includes features such as contiguous binding sites, transcripts, and
79
exons. These objects can be created by using the \Rfunction{GRanges}
80
constructor function. For example,
85
Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
87
IRanges(1:10, end = 7:16, names = head(letters, 10)),
89
Rle(strand(c("-", "+", "*", "+", "-")),
92
GC = seq(1, 0, length=10))
97
creates a \Rclass{GRanges} object with 10 single interval features.
98
The output of the \Rclass{GRanges} \Rmethod{show} method separates the
99
information into a left and right hand region that are separated by
100
\Rcode{|} symbols. The genomic coordinates (seqnames, ranges, and strand)
101
are located on the left-hand side and the metadata columns (annotation)
102
are located on the right. For this example, the metadata is
103
comprised of \Rcode{score} and \Rcode{GC} information, but almost
104
anything can be stored in the metadata portion of a \Rclass{GRanges}
107
The components of the genomic coordinates within a \Rclass{GRanges}
108
object can be extracted using the \Rmethod{seqnames}, \Rmethod{ranges},
109
and \Rmethod{strand} accessor functions.
111
<<GRanges-location-accessors>>=
117
Stored annotations for these coordinates can be extracted as a
118
\Rclass{DataFrame} object using the \Rmethod{mcols} accessor.
125
Finally, the total lengths of the various sequences that the ranges
126
are aligned to can also be stored in the \Rclass{GRanges} object. So
127
if this is data from \textit{Homo sapiens}, we can set the values as:
130
seqlengths(gr) <- c(249250621,243199373,198022430)
133
And then retrieves as:
138
Methods for accessing the \Rmethod{length} and \Rmethod{names} have
147
\subsection{Splitting and combining \Rclass{GRanges} objects}
149
\Rclass{GRanges} objects can be devided into groups using the
150
\Rmethod{split} method. This produces a \Rclass{GRangesList} object,
151
a class that will be discussed in detail in the next section.
153
<<splitAppendGRanges>>=
154
sp <- split(gr, rep(1:2, each=5))
158
If you then grab the components of this list, they can also be merged
159
by using the \Rmethod{c} and \Rmethod{append} methods.
165
\subsection{Subsetting \Rclass{GRanges} objects}
167
The expected subsetting operations are also available for
168
\Rclass{GRanges} objects.
174
A second argument to the \Rmethod{[} subset operator can be used
175
to specify which metadata columns to extract from the
176
\Rclass{GRanges} object. For example,
182
You can also assign into elements of the \Rclass{GRanges} object.
183
Here is an example where the 2nd row of a \Rclass{GRanges} object
184
is replaced with the 1st row of \Robject{gr}.
187
singles <- split(gr, names(gr))
189
grMod[2] <- singles[[1]]
193
Here is a second example where the metadata for score from the 3rd
194
element is replaced with the score from the 2nd row etc.
197
grMod[2,1] <- singles[[3]][,1]
201
There are also methods to repeat, reverse, or select specific portions
202
of \Rclass{GRanges} objects.
205
rep(singles[[2]], times = 3)
209
window(gr, start=2,end=4)
210
gr[IRanges(start=c(2,7), end=c(3,9))]
214
\subsection{Basic interval operations for \Rclass{GRanges} objects}
216
Basic interval characteristics of \Rclass{GRanges} objects can
217
be extracted using the \Rmethod{start}, \Rmethod{end}, \Rmethod{width},
218
and \Rmethod{range} methods.
222
g <- append(g, singles[[10]])
229
The \Rclass{GRanges} class also has many methods for manipulating the
230
intervals. For example, the \Rmethod{flank} method can be used to recover
231
regions flanking the set of ranges represented by the \Rclass{GRanges}
232
object. So to get a \Rclass{GRanges} object containing the ranges that
233
include the 10 bases upstream of the ranges:
239
And to include the downstream bases:
242
flank(g, 10, start=FALSE)
245
Similar to \Rmethod{flank}, there are also operations to
246
\Rmethod{resize} and \Rmethod{shift} our \Rclass{GRanges} object. The
247
\Rmethod{shift} method will move the ranges by a specific number of
248
base pairs, and the \Rmethod{resize} method will extend the ranges by
256
The \Rmethod{reduce} will align the ranges and merge overlapping
257
ranges to produce a simplified set.
263
Sometimes you may be interested in the spaces or the qualities of the
264
spaces between the ranges represented by your \Rclass{GRanges} object.
265
The \Rmethod{gaps} method will help you calculate the spaces between a
266
reduced version of your ranges:
272
And sometimes you also may want to know how many quantitatively unique
273
fragments your ranges could possibly represent. For this task there
274
is the \Rmethod{disjoin} method.
280
One of the most powerful methods for looking at \Rclass{GRanges}
281
objects is the \Rmethod{coverage} method. The \Rmethod{coverage}
282
method quantifies the degree of overlap for all the ranges in a
283
\Rclass{GRanges} object.
290
\subsection{Interval set operations for \Rclass{GRanges} objects}
292
There are also operations for calculating relationships between
293
different \Rclass{GRanges} objects. Here are a some examples for
294
how you can calculate the \Rmethod{union}, the \Rmethod{intersect}
295
and the asymmetric difference (using \Rmethod{setdiff}).
304
In addition, there is similar set of operations that act at the level
305
of the individual ranges within each \Rclass{GRanges}. These
306
operations all begin with a ``p", which is short for parallel. A
307
requirement for this set of operations is that the number of elements
308
in each \Rclass{GRanges} object has to be the same, and that both of
309
the objects have to have the same seqnames and strand assignments
314
ranges(g3[1]) <- IRanges(start=5, end=12)
320
For even more information on the \Rcode{GRanges} classes be sure to
321
consult the manual page.
323
<<manPage, eval=FALSE>>=
328
\section{\Rclass{GRangesList}: Multiple Interval Range Features}
330
Some important genomic features, such as spliced transcripts that are
331
are comprised of exons, are inherently compound structures. Such a
332
feature makes much more sense when expressed as a compound object
333
such as a \Rclass{GRangesList}. Whenever genomic features consist of
334
multiple ranges that are grouped by a parent feature, they can be
335
represented as \Rclass{GRangesList} object. Consider the simple
336
example of the two transcript \Rfunction{GRangesList} below created
337
using the \Rfunction{GRangesList} constructor.
339
<<example-GRangesList>>=
341
GRanges(seqnames = "chr2", ranges = IRanges(3, 6),
342
strand = "+", score = 5L, GC = 0.45)
344
GRanges(seqnames = c("chr1", "chr1"),
345
ranges = IRanges(c(7,13), width = 3),
346
strand = c("+", "-"), score = 3:4, GC = c(0.3, 0.5))
347
grl <- GRangesList("txA" = gr1, "txB" = gr2)
351
The \Rmethod{show} method for a \Rclass{GRangesList} object displays
352
it as a named list of \Rclass{GRanges} objects, where the names of
353
this list are considered to be the names of the grouping feature. In
354
the example above, the groups of individual exon ranges are represented
355
as separate \Rclass{GRanges} objects which are further organized into a
356
list structure where each element name is a transcript name. Many
357
other combinations of grouped and labeled \Rclass{GRanges} objects are
358
possible of course, but this example is expected to be a common
362
\subsection{Basic \Rclass{GRangesList} accessors}
364
Just as with \Rclass{GRanges} object, the components of the genomic
365
coordinates within a \Rclass{GRangesList} object can be extracted
366
using simple accessor methods. Not surprisingly, the
367
\Rclass{GRangesList} objects have many of the same accessors as
368
\Rclass{GRanges} objects. The difference is that many of these
369
methods return a list since the input is now essentially a list of
370
\Rclass{GRanges} objects. Here are a few examples:
372
<<basicGRLAccessors>>=
378
The \Rmethod{length} and \Rmethod{names} methods will return the length
379
or names of the list and the \Rmethod{seqlengths} method will return the
380
set of sequence lengths.
388
The \Rmethod{elementLengths} method returns a list of integers
389
corresponding to the result of calling \Rmethod{length} on each
390
individual \Rclass{GRanges} object contained by the
391
\Rclass{GRangesList}. This is a faster alternative to calling
392
\Rmethod{lapply} on the \Rclass{GRangesList}.
398
You can also use \Rmethod{isEmpty} to test if a \Rclass{GRangesList}
399
object contains anything.
405
Finally, in the context of a \Rclass{GRangesList} object, the
406
\Rmethod{mcols} method performs a similar operation to what it
407
does on a \Rclass{GRanges} object. However, this metadata now
408
refers to information at the list level instead of the level
409
of the individual \Rclass{GRanges} objects.
412
mcols(grl) <- c("Transcript A","Transcript B")
417
\subsection{Combining \Rclass{GRangesList} objects}
419
\Rclass{GRangesList} objects can be unlisted to combine the separate
420
\Rclass{GRanges} objects that they contain as an expanded
427
You can also append values together useing \Rmethod{append} or
431
% grl2 <- shift(grl,10)
432
% names(grl2) <- c("shiftTxA","shiftTxB")
439
\subsection{Basic interval operations for \Rclass{GRangesList} objects}
441
For interval operations, many of the same methods exist for
442
\Rclass{GRangesList} objects that exist for \Rclass{GRanges} objects.
450
And as with \Rclass{GRanges} objects, you can also shift all the
451
\Rclass{GRanges} objects in a \Rclass{GRangesList} object, or
452
calculate the coverage. Both of these operations are also carried out
453
across each \Rclass{GRanges} list member.
461
\subsection{Subsetting \Rclass{GRangesList} objects}
463
As you might guess, the subsetting of a \Rclass{GRangesList} object is
464
quite different from subsetting on a \Rclass{GRanges} object in that
465
it acts as if you are subseting a list. If you try out the following
466
you will notice that the standard conventions have been followed.
468
<<subsetGRL, eval=FALSE>>=
475
But in addition to this, when subsetting a \Rclass{GRangesList}, you
476
can also pass in a second parameter (as with a \Rclass{GRanges} object)
477
to again specify which of the metadata columns you wish to select.
484
The \Rmethod{head}, \Rmethod{tail}, \Rmethod{rep}, \Rmethod{rev},
485
and \Rmethod{window} methods all behave as you would expect them
486
to for a list object. For example, the elements
487
referred to by \Rmethod{window} are now list
488
elements instead of \Rclass{GRanges} elements.
491
rep(grl[[1]], times = 3)
495
window(grl, start=1, end=1)
496
grl[IRanges(start=2, end=2)]
500
\subsection{Looping over \Rclass{GRangesList} objects}
502
For \Rclass{GRangesList} objects there is also a family of
503
\Rmethod{apply} methods. These include \Rmethod{lapply},
504
\Rmethod{sapply}, \Rmethod{mapply}, \Rmethod{endoapply},
505
\Rmethod{mendoapply}, \Rmethod{Map}, and \Rmethod{Reduce}.
507
The different looping methods defined for \Rclass{GRangesList} objects
508
are useful for returning different kinds of results. The standard
509
\Rmethod{lapply} and \Rmethod{sapply} behave according to convention,
510
with the \Rmethod{lapply} method returning a list and \Rmethod{sapply}
511
returning a more simplified output.
518
As with \Rclass{IRanges} objects, there is also a multivariate version
519
of \Rmethod{sapply}, called \Rmethod{mapply}, defined for
520
\Rclass{GRangesList} objects. And, if you don't want the results
521
simplified, you can call the \Rmethod{Map} method, which does the same
522
things as \Rmethod{mapply} but without simplifying the output.
525
grl2 <- shift(grl, 10)
526
names(grl2) <- c("shiftTxA", "shiftTxB")
532
Sometimes, you may not want to get back a simplified output or a list.
533
Sometimes you will want to get back a modified version of the
534
\Rclass{GRangesList} that you originally passed in. This is
535
conceptually similar to the mathematical notion of an endomorphism.
536
This is achieved using the \Rmethod{endoapply} method, which will return
537
the results as a \Rclass{GRangesList} object.
543
And, there is also a multivariate version of the \Rmethod{endoapply}
544
method in the form of the \Rmethod{mendoapply} method.
547
mendoapply(c,grl,grl2)
550
Finally, the \Rmethod{Reduce} method will allow the \Rclass{GRanges}
551
objects to be collapsed across the whole of the \Rclass{GRangesList}
554
% Again, this seems like a sub-optimal example to me.
560
For even more information on the \Rcode{GRangesList} classes be sure to
561
consult the manual page.
563
<<manPage2, eval=FALSE>>=
568
\section{Interval overlaps involving \Rclass{GRanges} and \Rclass{GRangesList} objects}
569
Interval overlapping is the process of comparing the ranges in two
570
objects to determine if and when they overlap. As such, it is perhaps
571
the most common operation performed on \Rclass{GRanges} and
572
\Rclass{GRangesList} objects. To this end, the \Rpackage{GenomicRanges}
573
package provides a family of interval overlap functions. The most general
574
of these functions is \Rfunction{findOverlaps}, which takes a query and a
575
subject as inputs and returns a \Rclass{Hits} object containing
576
the index pairings for the overlapping elements.
579
mtch <- findOverlaps(gr, grl)
584
As suggested in the sections discussing the nature of the
585
\Rclass{GRanges} and \Rclass{GRangesList} classes, the index in the
586
above matrix of hits for a \Rclass{GRanges} object is a single range
587
while for a \Rclass{GRangesList} object it is the set of ranges that
590
Another function in the overlaps family is \Rfunction{countOverlaps},
591
which tabulates the number of overlaps for each element in the query.
594
countOverlaps(gr, grl)
597
A third function in this family is \Rfunction{subsetByOverlaps},
598
which extracts the elements in the query that overlap at least one
599
element in the subject.
601
<<subsetByOverlaps>>=
602
subsetByOverlaps(gr,grl)
605
Finally, you can use the \Rcode{select} argument to get the index
606
of the first overlapping element in the subject for each element
610
findOverlaps(gr, grl, select="first")
611
findOverlaps(grl, gr, select="first")
615
\section{Genomic Alignments}
617
In addition to \Rclass{GRanges} and \Rclass{GRangesList} classes, the
618
\GenomicRanges{} package defines the \Rclass{GAlignments} class,
619
which is a more specialized container for storing a set of genomic
620
alignments. The class is intended to support alignments in general,
621
not only those coming from a 'Binary Alignment Map' or 'BAM' files.
622
Also alignments with gaps in the reference sequence (a.k.a.
623
\emph{gapped alignments}) are supported which, for example, makes
624
the class suited for storing junction reads from an RNA-seq experiment.
626
More precisely, a \Rclass{GAlignments} object is a vector-like
627
object where each element describes an \emph{alignment}, that is,
628
how a given sequence (called \emph{query} or \emph{read}, typically
629
short) aligns to a reference sequence (typically long).
631
As shown later in this document, a \Rclass{GAlignments} object
632
can be created from a 'BAM' file. In that case, each element in the
633
resulting object will correspond to a record in the file.
634
One important thing to note though is that not all the information
635
present in the BAM/SAM records is stored in the object. In particular,
636
for now, we discard the query sequences (SEQ field), the query ids
637
(QNAME field), the query qualities (QUAL), the mapping qualities (MAPQ)
638
and any other information that is not needed in order to support the
639
basic set of operations described in this document.
640
This also means that multi-reads (i.e. reads with multiple hits in the
641
reference) don't receive any special treatment i.e. the various SAM/BAM
642
records corresponding to a multi-read will show up in the \Rclass{GAlignments}
643
object as if they were coming from different/unrelated queries.
644
Also paired-end reads will be treated as single-end reads and the
645
pairing information will be lost. This might change in the future.
648
\subsection{Load a `BAM' file into a \Rclass{GAlignments} object}
650
First we use the \Rfunction{readGAlignments} function from the
651
\Rpackage{GenomicAlignments} package to load a toy `BAM' file into
652
a \Rclass{GAlignments} object:
654
library(GenomicAlignments)
655
aln1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
656
aln1 <- readGAlignments(aln1_file)
661
3271 `BAM' records were loaded into the object.
663
Note that \Rfunction{readGAlignments} would have discarded
664
any `BAM' record describing an unaligned query (see description
665
of the <flag> field in the SAM Format Specification
666
\footnote{\url{http://samtools.sourceforge.net/SAM1.pdf}}
667
for more information).
668
The reader interested in tracking down these events can always
669
use the \Rfunction{scanBam} function but this goes beyond the scope
672
\subsection{Simple accessor methods}
674
There is one accessor per field displayed by the \Rmethod{show} method
675
and it has the same name as the field. All of them return a vector or
676
factor of the same length as the object:
689
\subsection{More accessor methods}