1
\name{GAlignments-class}
5
\alias{class:GAlignments}
6
\alias{GAlignments-class}
10
\alias{updateObject,GAlignments-method}
11
\alias{readGAlignments}
14
\alias{length,GAlignments-method}
15
\alias{names,GAlignments-method}
16
\alias{seqnames,GAlignments-method}
18
\alias{rname,GAlignments-method}
19
\alias{strand,GAlignments-method}
20
\alias{names<-,GAlignments-method}
21
\alias{seqnames<-,GAlignments-method}
23
\alias{rname<-,GAlignments-method}
24
\alias{strand<-,GAlignments-method}
26
\alias{cigar,GAlignments-method}
28
\alias{qwidth,GAlignments-method}
29
\alias{start,GAlignments-method}
30
\alias{end,GAlignments-method}
31
\alias{width,GAlignments-method}
32
\alias{ngap,GAlignments-method}
33
\alias{elementMetadata<-,GAlignments-method}
34
\alias{seqinfo,GAlignments-method}
35
\alias{seqinfo<-,GAlignments-method}
42
\alias{grglist,GAlignments-method}
43
\alias{granges,GAlignments-method}
44
\alias{introns,GAlignments-method}
45
\alias{rglist,GAlignments-method}
46
\alias{ranges,GAlignments-method}
47
\alias{coerce,GAlignments,GRangesList-method}
48
\alias{coerce,GAlignments,GRanges-method}
49
\alias{coerce,GAlignments,RangesList-method}
50
\alias{coerce,GAlignments,Ranges-method}
51
\alias{as.data.frame,GAlignments-method}
54
\alias{c,GAlignments-method}
57
\alias{show,GAlignments-method}
61
\alias{qnarrow,GAlignments-method}
62
\alias{splitAsListReturnedClass,GAlignments-method}
65
\alias{GappedAlignments}
66
\alias{class:GappedAlignments}
67
\alias{GappedAlignments-class}
68
\alias{show,GappedAlignments-method}
69
\alias{readGappedAlignments}
72
\title{GAlignments objects}
75
The GAlignments class is a simple container which purpose is
76
to store a set of genomic alignments that will hold just enough
77
information for supporting the operations described below.
81
A GAlignments object is a vector-like object where each element
82
describes a genomic alignment i.e. how a given sequence (called "query"
83
or "read", typically short) aligns to a reference sequence (typically
86
Typically, a GAlignments object will be created by loading
87
records from a BAM (or SAM) file and each element in the resulting
88
object will correspond to a record. BAM/SAM records generally contain
89
a lot of information but only part of that information is loaded
90
in the GAlignments object. In particular, we discard the query
91
sequences (SEQ field), the query qualities (QUAL), the mapping qualities
92
(MAPQ) and any other information that is not needed in order to support
93
the operations or methods described below.
95
This means that multi-reads (i.e. reads with multiple hits in the
96
reference) won't receive any special treatment i.e. the various SAM/BAM
97
records corresponding to a multi-read will show up in the GAlignments
98
object as if they were coming from different/unrelated queries.
99
Also paired-end reads will be treated as single-end reads and the
100
pairing information will be lost (see \code{?\link{GAlignmentPairs}}
101
for how to handle aligned paired-end reads).
103
Each element of a GAlignments object consists of:
105
\item The name of the reference sequence. (This is the RNAME field
106
in a SAM/BAM record.)
107
\item The strand in the reference sequence to which the query is
108
aligned. (This information is stored in the FLAG field in a
110
\item The CIGAR string in the "Extended CIGAR format" (see the SAM
111
Format Specifications for the details).
112
\item The 1-based leftmost position/coordinate of the clipped query
113
relative to the reference sequence. We will refer to it as
114
the "start" of the query. (This is the POS field in a SAM/BAM
116
\item The 1-based rightmost position/coordinate of the clipped query
117
relative to the reference sequence. We will refer to it as
118
the "end" of the query. (This is NOT explicitly stored in a
119
SAM/BAM record but can be inferred from the POS and CIGAR fields.)
120
Note that all positions/coordinates are always relative to
121
the first base at the 5' end of the plus strand of the reference
122
sequence, even when the query is aligned to the minus strand.
123
\item The genomic intervals between the "start" and "end" of the query
124
that are "covered" by the alignment. Saying that the full
125
[start,end] interval is covered is the same as saying that the
126
alignment has no gap (no N in the CIGAR). It is then considered
127
a simple alignment. Note that a simple alignment can have
128
mismatches or deletions (in the reference). In other words, a
129
deletion, encoded with a D, is NOT considered a gap.
132
Note that the last 2 items are not expicitly stored in the GAlignments
133
object: they are inferred on-the-fly from the CIGAR and the "start".
135
Optionally, a GAlignments object can have names (accessed thru the
136
\code{\link[base]{names}} generic function) which will be coming from
137
the QNAME field of the SAM/BAM records.
139
The rest of this man page will focus on describing how to:
141
\item Access the information stored in a GAlignments object
142
in a way that is independent from how the data are actually
144
\item How to create and manipulate a GAlignments object.
148
\section{Constructors}{
151
\code{readGAlignments(file, format="BAM", use.names=FALSE, ...)}:
152
Read a file containing aligned reads as a GAlignments object.
153
By default (i.e. \code{use.names=FALSE}), the resulting object has no
154
names. If \code{use.names} is \code{TRUE}, then the names are
155
constructed from the query template names (QNAME field in a SAM/BAM
158
Note that this function is just a front-end that delegates to the
159
format-specific back-end function specified via the \code{format}
160
argument. The \code{use.names} argument and any extra argument are
161
passed to the back-end function.
162
Only the BAM format is supported for now. Its back-end is the
163
\code{\link[Rsamtools]{readGAlignmentsFromBam}} function defined
164
in the Rsamtools package.
165
See \code{?\link[Rsamtools]{readGAlignmentsFromBam}} for
166
more information (you might need to install and load the Rsamtools
170
\code{GAlignments(seqnames=Rle(factor()), pos=integer(0),
172
strand=NULL, names=NULL, seqlengths=NULL, ...)}:
173
Low-level GAlignments constructor. Generally not used directly.
174
Named arguments in \code{...} are used as metadata columns.
180
In the code snippets below, \code{x} is a GAlignments object.
185
Return the number of alignments in \code{x}.
188
\code{names(x)}, \code{names(x) <- value}:
189
Get or set the names of \code{x}.
190
See \code{readGAlignments} above for how to automatically extract
191
and set the names from the file to read.
194
\code{seqnames(x)}, \code{seqnames(x) <- value}:
195
Get or set the name of the reference sequence for each alignment
196
in \code{x} (see Details section above for more information about
197
the RNAME field of a SAM/BAM file).
198
\code{value} can be a factor, or a 'factor' \link[IRanges]{Rle},
199
or a character vector.
202
\code{rname(x)}, \code{rname(x) <- value}:
203
Same as \code{seqnames(x)} and \code{seqnames(x) <- value}.
206
\code{strand(x)}, \code{strand(x) <- value}:
207
Get or set the strand for each alignment in \code{x} (see Details
208
section above for more information about the strand of an alignment).
209
\code{value} can be a factor (with levels +, - and *), or a 'factor'
210
\link[IRanges]{Rle}, or a character vector.
214
Returns a character vector of length \code{length(x)}
215
containing the CIGAR string for each alignment.
219
Returns an integer vector of length \code{length(x)}
220
containing the length of the query *after* hard clipping
221
(i.e. the length of the query sequence that is stored in
222
the corresponding SAM/BAM record).
225
\code{start(x)}, \code{end(x)}:
226
Returns an integer vector of length \code{length(x)}
227
containing the "start" and "end" (respectively) of the query
228
for each alignment. See Details section above for the exact
229
definitions of the "start" and "end" of a query.
230
Note that \code{start(x)} and \code{end(x)} are equivalent
231
to \code{start(granges(x))} and \code{end(granges(x))},
232
respectively (or, alternatively, to \code{min(rglist(x))} and
233
\code{max(rglist(x))}, respectively).
237
Equivalent to \code{width(granges(x))} (or, alternatively, to
238
\code{end(x) - start(x) + 1L}).
239
Note that this is generally different from \code{qwidth(x)}
240
except for alignments with a trivial CIGAR string (i.e. a
241
string of the form \code{"<n>M"} where <n> is a number).
245
Returns an integer vector of the same length as \code{x} containing
246
the number of gaps (i.e. N operations in the CIGAR) for each alignment.
247
Equivalent to \code{unname(elementLengths(rglist(x))) - 1L}.
250
\code{seqinfo(x)}, \code{seqinfo(x) <- value}:
251
Get or set the information about the underlying sequences.
252
\code{value} must be a \link{Seqinfo} object.
255
\code{seqlevels(x)}, \code{seqlevels(x) <- value}:
256
Get or set the sequence levels.
257
\code{seqlevels(x)} is equivalent to \code{seqlevels(seqinfo(x))}
258
or to \code{levels(seqnames(x))}, those 2 expressions being
259
guaranteed to return identical character vectors on a GAlignments
260
object. \code{value} must be a character vector with no NAs.
261
See \code{?\link{seqlevels}} for more information.
264
\code{seqlengths(x)}, \code{seqlengths(x) <- value}:
265
Get or set the sequence lengths.
266
\code{seqlengths(x)} is equivalent to \code{seqlengths(seqinfo(x))}.
267
\code{value} can be a named non-negative integer or numeric vector
271
\code{isCircular(x)}, \code{isCircular(x) <- value}:
272
Get or set the circularity flags.
273
\code{isCircular(x)} is equivalent to \code{isCircular(seqinfo(x))}.
274
\code{value} must be a named logical vector eventually with NAs.
277
\code{genome(x)}, \code{genome(x) <- value}:
278
Get or set the genome identifier or assembly name for each sequence.
279
\code{genome(x)} is equivalent to \code{genome(seqinfo(x))}.
280
\code{value} must be a named character vector eventually with NAs.
283
\code{seqnameStyle(x)}:
284
Get or set the seqname style for \code{x}.
285
Note that this information is not stored in \code{x} but inferred
286
by looking up \code{seqnames(x)} against a seqname style database
287
stored in the seqnames.db metadata package (required).
288
\code{seqnameStyle(x)} is equivalent to \code{seqnameStyle(seqinfo(x))}
289
and can return more than 1 seqname style (with a warning)
290
in case the style cannot be determined unambiguously.
296
In the code snippets below, \code{x} is a GAlignments object.
300
\code{grglist(x, order.as.in.query=FALSE,
301
drop.D.ranges=FALSE)},
302
\code{rglist(x, order.as.in.query=FALSE,
303
drop.D.ranges=FALSE)}:
305
Return either a \link{GRangesList} or a \link[IRanges]{RangesList}
306
object of length \code{length(x)} where the i-th element represents
307
the ranges (with respect to the reference) of the i-th alignment in
310
More precisely, the \link[IRanges]{RangesList} object returned by
311
\code{rglist(x)} is a \link[IRanges]{CompressedIRangesList} object.
313
The \code{order.as.in.query} toggle affects the order of the ranges
314
\emph{within} each top-level element of the returned object.
316
If \code{FALSE} (the default), then the ranges are ordered from 5' to 3'
317
in elements associated with the plus strand (i.e. corresponding to
318
alignments located on the plus strand), and from 3' to 5' in elements
319
associated with the minus strand. So, whatever the strand is, the ranges
320
are in ascending order (i.e. left-to-right).
322
If \code{TRUE}, then the order of the ranges in elements associated
323
with the \emph{minus} strand is reversed. So they end up being
324
ordered from 5' to 3' too, which means that they are now in decending
325
order (i.e. right-to-left). It also means that, when
326
\code{order.as.in.query=TRUE} is used, the ranges are
327
\emph{always} ordered consistently with the original "query template",
328
that is, in the order defined by walking the "query template" from the
329
beginning to the end.
331
If \code{drop.D.ranges} is \code{TRUE}, then deletions (D operations
332
in the CIGAR) are treated like gaps (N operations in the CIGAR),
333
that is, the ranges corresponding to deletions are dropped.
335
See Details section above for more information.
338
\code{granges(x)}, \code{ranges(x)}:
339
Return either a \link{GRanges} or a \link[IRanges]{Ranges}
340
object of length \code{length(x)} where each element represents the
341
regions in the reference to which a query is aligned.
343
More precisely, the \link[IRanges]{Ranges} object returned by
344
\code{ranges(x)} is an \link[IRanges]{IRanges} object.
347
\code{introns(x)}: Extract the gaps (i.e. N operations in the CIGAR)
348
as a \link{GRangesList} object of the same length as \code{x}.
351
psetdiff(granges(x), grglist(x, order.as.in.query=TRUE))
355
\code{as(x, "GRangesList")}, \code{as(x, "GRanges")},
356
\code{as(x, "RangesList")}, \code{as(x, "Ranges")}:
357
Alternate ways of doing \code{grglist(x)}, \code{granges(x)},
358
\code{rglist(x)}, \code{ranges(x)}, respectively.
363
\section{Subsetting and related operations}{
364
In the code snippets below, \code{x} is a GAlignments object.
369
Return a new GAlignments object made of the selected
370
alignments. \code{i} can be a numeric or logical vector.
379
Concatenates the GAlignments objects in \code{...}.
384
\section{Other methods}{
388
\code{qnarrow(x, start=NA, end=NA, width=NA)}:
389
\code{x} is a GAlignments object.
390
Return a new GAlignments object of the same length as \code{x}
391
describing how the narrowed query sequences align to the reference.
392
The \code{start}/\code{end}/\code{width} arguments describe how
393
to narrow the query sequences. They must be vectors of integers.
394
NAs and negative values are accepted and "solved" according to the
395
rules of the SEW (Start/End/Width) interface (see
396
\code{?\link[IRanges]{solveUserSEW}} for the details).
400
By default the \code{show} method displays 5 head and 5 tail
401
elements. This can be changed by setting the global options
402
\code{showHeadLines} and \code{showTailLines}. If the object
403
length is less than (or equal to) the sum of these 2 options
404
plus 1, then the full object is displayed.
405
Note that these options also affect the display of \link{GRanges}
406
and \link{GAlignmentPairs} objects, as well as other objects defined
407
in the IRanges and Biostrings packages (e.g. \link[IRanges]{Ranges}
408
and \link[Biostrings]{XStringSet} objects).
414
\url{http://samtools.sourceforge.net/}
418
H. Pages and P. Aboyoun
423
\item \link{GAlignmentPairs-class}.
424
\item \code{\link[Rsamtools]{readGAlignmentsFromBam}}.
425
\item \link{GRangesList-class}.
426
\item \link{GRanges-class}.
427
\item \link{findOverlaps-methods}.
428
\item \link{coverage-methods}.
429
\item \code{\link{seqinfo}}.
430
\item \link[IRanges]{CompressedIRangesList-class}.
431
\item \link{setops-methods}.
436
library(Rsamtools) # for ScanBamParam() and the ex1.bam file
437
ex1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
438
gal <- readGAlignments(ex1_file, param=ScanBamParam(what="flag"))
441
## ---------------------------------------------------------------------
442
## A. BASIC MANIPULATION
443
## ---------------------------------------------------------------------
446
names(gal) # no names by default
458
## Rename the reference sequences:
459
seqlevels(gal) <- sub("seq", "chr", seqlevels(gal))
462
grglist(gal) # a GRangesList object
463
stopifnot(identical(unname(elementLengths(grglist(gal))), ngap(gal) + 1L))
464
granges(gal) # a GRanges object
465
rglist(gal) # a CompressedIRangesList object
466
stopifnot(identical(unname(elementLengths(rglist(gal))), ngap(gal) + 1L))
467
ranges(gal) # an IRanges object
468
introns(gal) # a GRangesList object
469
stopifnot(identical(unname(elementLengths(introns(gal))), ngap(gal)))
471
## Modify the number of lines in 'show'
472
options("showHeadLines"=3)
473
options("showTailLines"=2)
477
options("showHeadLines"=NULL)
478
options("showTailLines"=NULL)
480
## ---------------------------------------------------------------------
482
## ---------------------------------------------------------------------
483
gal[strand(gal) == "-"]
484
gal[grep("I", cigar(gal), fixed=TRUE)]
485
gal[grep("N", cigar(gal), fixed=TRUE)] # no gaps
487
## A confirmation that all the queries map to the reference with no
489
stopifnot(all(ngap(gal) == 0))
491
## Different ways to subset:
492
gal[6] # a GAlignments object of length 1
493
grglist(gal)[[6]] # a GRanges object of length 1
494
rglist(gal)[[6]] # a NormalIRanges object of length 1
496
## D operations are NOT gaps:
497
ii <- grep("D", cigar(gal), fixed=TRUE)
502
## qwidth() vs width():
503
gal[qwidth(gal) != width(gal)]
505
## This MUST return an empty object:
506
gal[cigar(gal) == "35M" & qwidth(gal) != 35]
507
## but this doesn't have too:
508
gal[cigar(gal) != "35M" & qwidth(gal) == 35]
510
## ---------------------------------------------------------------------
511
## C. qnarrow()/narrow()
512
## ---------------------------------------------------------------------
513
## Note that there is no difference between qnarrow() and narrow() when
514
## all the alignments are simple and with no indels.
516
## This trims 3 nucleotides on the left and 5 nucleotides on the right
517
## of each alignment:
518
qnarrow(gal, start=4, end=-6)
519
## Note that the 'start' and 'end' arguments specify what part of each
520
## query sequence should be kept (negative values being relative to the
521
## right end of the query sequence), not what part should be trimmed.
523
## Trimming on the left doesn't change the "end" of the queries.
524
qnarrow(gal, start=21)
525
stopifnot(identical(end(qnarrow(gal, start=21)), end(gal)))