12
\alias{cigarToIRanges}
13
\alias{cigarToIRangesListByAlignment}
14
\alias{cigarToIRangesListByRName}
16
\alias{queryLoc2refLoc}
17
\alias{queryLocs2refLocs}
20
\alias{cigarToRleList}
21
\alias{cigarToCigarTable}
22
\alias{summarizeCigarTable}
25
CIGAR utility functions
28
Utility functions for low-level CIGAR manipulation.
33
cigarToQWidth(cigar, before.hard.clipping=FALSE)
36
cigarQNarrow(cigar, start=NA, end=NA, width=NA)
37
cigarNarrow(cigar, start=NA, end=NA, width=NA)
40
drop.D.ranges=FALSE, drop.empty.ranges=FALSE,
43
cigarToIRangesListByAlignment(cigar, pos, flag=NULL,
44
drop.D.ranges=FALSE, drop.empty.ranges=FALSE,
47
cigarToIRangesListByRName(cigar, rname, pos, flag=NULL,
48
drop.D.ranges=FALSE, drop.empty.ranges=FALSE,
51
queryLoc2refLoc(qloc, cigar, pos=1)
52
queryLocs2refLocs(qlocs, cigar, pos, flag=NULL)
56
cigarToCigarTable(cigar)
57
summarizeCigarTable(x)
62
A character vector/factor containing the extended CIGAR
64
For \code{cigarToIRanges} and \code{queryLoc2refLoc},
65
this must be a single string (i.e. a character vector/factor
68
\item{before.hard.clipping}{
69
Should the returned widths be the lengths of the reads before
70
or after "hard clipping"? Hard clipping of a read is
71
encoded with an H in the CIGAR.
72
If NO (\code{before.hard.clipping=FALSE}, the default),
73
then the returned widths are the lengths of the query
74
sequences stored in the SAM/BAM file.
75
If YES (\code{before.hard.clipping=TRUE}), then the returned
76
widths are the lengths of the original reads.
78
\item{start,end,width}{
79
Vectors of integers. NAs and negative values are accepted and
80
"solved" according to the rules of the SEW (Start/End/Width)
81
interface (see \code{?\link[IRanges]{solveUserSEW}} for the details).
84
Should the ranges corresponding to a deletion from the
85
reference (encoded with a D in the CIGAR) be dropped?
86
By default we keep them to be consistent with the pileup tool
88
Note that, when \code{drop.D.ranges} is \code{TRUE}, then Ds
89
and Ns in the CIGAR are equivalent.
91
\item{drop.empty.ranges}{
92
Should empty ranges be dropped?
95
Should adjacent ranges coming from the same cigar be merged or not?
96
Using \code{TRUE} (the default) can significantly reduce the
97
size of the returned object.
100
An integer vector containing the 1-based leftmost
101
position/coordinate for each (eventually clipped) read
105
\code{NULL} or an integer vector containing the SAM flag for
107
According to the SAM specs, flag bit 0x004 has the following
108
meaning: when bit 0x004 is ON then "the query sequence itself
110
When \code{flag} is provided,
111
\code{cigarToIRangesListByAlignment} and
112
\code{cigarToIRangesListByRName} ignore these reads.
115
A character vector/factor containing the name of the
116
reference sequence associated with each read (i.e. the
117
name of the sequence the read has been aligned to).
120
An integer vector containing "query-based locations" i.e.
121
1-based locations relative to the query sequence
122
stored in the SAM/BAM file.
125
A list of the same length as \code{cigar} where each
126
element is an integer vector containing "query-based
127
locations" i.e. 1-based locations relative to the corresponding
128
query sequence stored in the SAM/BAM file.
131
A \link[IRanges]{DataFrame} produced by \code{cigarToCigarTable}.
136
For \code{cigarOpTable}: An integer matrix with number of rows equal
137
to the length of \code{cigar} and seven columns, one for each extended
140
For \code{cigarToQWidth}: An integer vector of the same length
141
as \code{cigar} where each element is the width of the query (i.e.
142
the length of the query sequence) as inferred from the corresponding
143
element in \code{cigar} (NAs in \code{cigar} will produce NAs in the
146
For \code{cigarQNarrow} and \code{cigarNarrow}: A character vector
147
of the same length as \code{cigar} containing the narrowed cigars.
148
In addition the vector has an "rshift" attribute which is an integer
149
vector of the same length as \code{cigar}. It contains the values that
150
would need to be added to the POS field of a SAM/BAM file as a
151
consequence of this cigar narrowing.
153
For \code{cigarToWidth}: An integer vector of the same length
154
as \code{cigar} where each element is the width of the alignment
155
(i.e. its total length on the reference, gaps included)
156
as inferred from the corresponding element in \code{cigar}
157
(NAs in \code{cigar} will produce NAs in the returned vector).
159
For \code{cigarToIRanges}: An \link[IRanges]{IRanges} object
160
describing where the bases in the read align with respect to
161
an imaginary reference sequence assuming that the leftmost
162
aligned base is at position 1 in the reference (i.e. at the
165
For \code{cigarToIRangesListByAlignment}: A
166
\link[IRanges]{CompressedIRangesList} object of the same length
169
For \code{cigarToIRangesListByRName}: A named \link[IRanges]{IRangesList}
170
object with one element (\link[IRanges]{IRanges}) per unique reference
173
For \code{queryLoc2refLoc}: An integer vector of the same length as
174
\code{qloc} containing the "reference-based locations" (i.e. the
175
1-based locations relative to the reference sequence) corresponding
176
to the "query-based locations" passed in \code{qloc}.
178
For \code{queryLocs2refLocs}: A list of the same length as
179
\code{qlocs} where each element is an integer vector containing
180
the "reference-based locations" corresponding to the "query-based
181
locations" passed in the corresponding element in \code{qlocs}.
183
For \code{splitCigar}: A list of the same length as \code{cigar}
184
where each element is itself a list with 2 elements of the same
185
lengths, the 1st one being a raw vector containing the CIGAR
186
operations and the 2nd one being an integer vector containing
187
the lengths of the CIGAR operations.
189
For \code{cigarToRleList}: A \link[IRanges]{CompressedRleList} object.
191
For \code{cigarToCigarTable}: A frequency table of the CIGARs
192
in the form of a \link[IRanges]{DataFrame} with two columns:
193
\code{cigar} (a \link[IRanges]{CompressedRleList}) and \code{count}
196
For \code{summarizeCigarTable}: A list with two elements:
197
\code{AlignedCharacters} (integer) and \code{Indels} (matrix)
201
\url{http://samtools.sourceforge.net/}
205
H. Pages and P. Aboyoun
209
\link[IRanges]{IRanges-class},
210
\link[IRanges]{IRangesList-class},
211
\code{\link[IRanges]{coverage}},
212
\link[IRanges]{RleList-class}
216
## ---------------------------------------------------------------------
217
## A. SIMPLE EXAMPLES
218
## ---------------------------------------------------------------------
220
## With a cigar vector of length 1:
221
cigar1 <- "3H15M55N4M2I6M2D5M6S"
223
## cigarToQWidth()/cigarToWidth():
224
cigarToQWidth(cigar1)
225
cigarToQWidth(cigar1, before.hard.clipping=TRUE)
229
cigarQNarrow(cigar1, start=4, end=-3)
230
cigarQNarrow(cigar1, start=10)
231
cigarQNarrow(cigar1, start=19)
232
cigarQNarrow(cigar1, start=24)
235
cigarNarrow(cigar1) # only drops the soft/hard clipping
236
cigarNarrow(cigar1, start=10)
237
cigarNarrow(cigar1, start=15)
238
cigarNarrow(cigar1, start=15, width=57)
239
cigarNarrow(cigar1, start=16)
240
#cigarNarrow(cigar1, start=16, width=55) # ERROR! (empty cigar)
241
cigarNarrow(cigar1, start=71)
242
cigarNarrow(cigar1, start=72)
243
cigarNarrow(cigar1, start=75)
246
cigarToIRanges(cigar1)
247
cigarToIRanges(cigar1, reduce.ranges=FALSE)
248
cigarToIRanges(cigar1, drop.D.ranges=TRUE)
250
## With a cigar vector of length 4:
251
cigar2 <- c("40M", cigar1, "2S10M2000N15M", "3H25M5H")
252
pos <- c(1, 1001, 1, 351)
253
cigarToIRangesListByAlignment(cigar2, pos)
254
rname <- c("chr6", "chr6", "chr2", "chr6")
255
cigarToIRangesListByRName(cigar2, rname, pos)
260
cigarToRleList(cigar2)
262
cigarToCigarTable(cigar2)
263
cigarToCigarTable(cigar2)[,"cigar"]
264
cigarToCigarTable(cigar2)[,"count"]
266
summarizeCigarTable(cigarToCigarTable(cigar2))
268
## ---------------------------------------------------------------------
270
## ---------------------------------------------------------------------
273
## We simulate 20 millions aligned reads, all 40-mers. 95% of them
274
## align with no indels. 5% align with a big deletion in the
275
## reference. In the context of an RNAseq experiment, those 5% would
276
## be suspected to be "junction reads".
279
njunctionreads <- nreads * 5L / 100L
280
cigar3 <- character(nreads)
282
junctioncigars <- paste(
283
paste(10:30, "M", sep=""),
284
paste(sample(80:8000, njunctionreads, replace=TRUE), "N", sep=""),
285
paste(30:10, "M", sep=""), sep="")
286
cigar3[sample(nreads, njunctionreads)] <- junctioncigars
287
some_fake_rnames <- paste("chr", c(1:6, "X"), sep="")
288
rname <- sample(some_fake_rnames, nreads, replace=TRUE)
289
pos <- sample(80000000L, nreads, replace=TRUE)
291
## The following takes < 5 sec. to complete:
292
system.time(rglist <- cigarToIRangesListByAlignment(cigar3, pos))
294
## The following takes < 10 sec. to complete:
295
system.time(irl <- cigarToIRangesListByRName(cigar3, rname, pos))
297
## Internally, cigarToIRangesListByRName() turns 'rname' into a factor
298
## before starting the calculation. Hence it will run sligthly
299
## faster if 'rname' is already a factor.
300
rname2 <- as.factor(rname)
301
system.time(irl2 <- cigarToIRangesListByRName(cigar3, rname2, pos))
303
## The sizes of the resulting objects are about 240M and 160M,
309
## ---------------------------------------------------------------------
310
## C. COMPUTE THE COVERAGE OF THE READS STORED IN A BAM FILE
311
## ---------------------------------------------------------------------
312
## The information stored in a BAM file can be used to compute the
313
## "coverage" of the mapped reads i.e. the number of reads that hit any
314
## given position in the reference genome.
315
## The following function takes the path to a BAM file and returns an
316
## object representing the coverage of the mapped reads that are stored
317
## in the file. The returned object is an RleList object named with the
318
## names of the reference sequences that actually receive some coverage.
320
extractCoverageFromBAM <- function(file)
322
## This ScanBamParam object allows us to load only the necessary
323
## information from the file.
324
param <- ScanBamParam(flag=scanBamFlag(isUnmappedQuery=FALSE,
326
what=c("rname", "pos", "cigar"))
327
bam <- scanBam(file, param=param)[[1]]
328
## Note that unmapped reads and reads that are PCR/optical duplicates
329
## have already been filtered out by using the ScanBamParam object above.
330
irl <- cigarToIRangesListByRName(bam$cigar, bam$rname, bam$pos)
331
irl <- irl[elementLengths(irl) != 0] # drop empty elements
336
f1 <- system.file("extdata", "ex1.bam", package="Rsamtools")
337
extractCoverageFromBAM(f1)