1
%\VignetteIndexEntry{GenomicRanges HOWTOs}
2
%\VignetteDepends{GenomicRanges, Rsamtools, pasillaBamSubset, TxDb.Dmelanogaster.UCSC.dm3.ensGene, AnnotationHub, DESeq, edgeR, TxDb.Hsapiens.UCSC.hg19.knownGene, GenomicFeatures, Biostrings, BSgenome.Hsapiens.UCSC.hg19, KEGG.db, KEGGgraph, BSgenome.Scerevisiae.UCSC.sacCer2}
3
%\VignetteKeywords{sequence, sequencing, alignments}
4
%\VignettePackage{GenomicRanges}
6
\documentclass{article}
8
\usepackage[authoryear,round]{natbib}
10
<<style, eval=TRUE, echo=FALSE, results=tex>>=
11
BiocStyle::latex(use.unsrturl=FALSE)
14
\title{\Biocpkg{GenomicRanges} HOWTOs}
15
\author{Bioconductor Team}
16
\date{Edited: October 2013; Compiled: \today}
24
<<options, echo=FALSE>>=
26
options("showHeadLines" = 3)
27
options("showTailLines" = 3)
31
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
32
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
34
\section{Introduction}
36
This vignette is a collection of {\it HOWTOs}. Each {\it HOWTO} is
37
a short section that demonstrates how to use the containers and
38
operations implemented in the \Biocpkg{GenomicRanges} and related
39
packages (\Biocpkg{IRanges}, \Biocpkg{GenomicFeatures},
40
\Biocpkg{Rsamtools}, and \Biocpkg{Biostrings}) to perform a task
41
typically found in the context of a high throughput sequence analysis.
43
The HOWTOs are self contained, independent of each other, and can be
44
studied and reproduced in any order.
46
We assume the reader has some previous experience with \R{} and
47
with basic manipulation of \Rcode{GRanges}, \Rcode{GRangesList}, \Rcode{Rle},
48
\Rcode{RleList}, and \Rcode{DataFrame} objects. See the ``An Introduction
49
to Genomic Ranges Classes'' vignette located in the \Biocpkg{GenomicRanges}
50
package (in the same folder as this document) for an introduction to these
53
Additional recommended readings after this vignette are the ``Software for
54
Computing and Annotating Genomic Ranges'' paper[\citet{Lawrence2013ranges}]
55
and the ``Counting reads with \Rfunction{summarizeOverlaps}'' vignette
56
located in the \Biocpkg{GenomicAlignments} package.
58
To display the list of vignettes available in the \Biocpkg{GenomicRanges},
59
use \Rcode{browseVignettes("GenomicRanges")}.
62
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
63
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
65
\section{How to read BAM files into \R{}}
67
As sample data we use the \Biocpkg{pasillaBamSubset} data package which
68
contains both a BAM file with single-end reads (untreated1\_chr4) and
69
a BAM file with paired-end reads (untreated3\_chr4). Each file is a subset
70
of chr4 from the "Pasilla" experiment. See ?\Biocpkg{pasillaBamSubset}
73
<<load, results=hide>>=
74
library(pasillaBamSubset)
75
un1 <- untreated1_chr4() ## single-end reads
78
Several functions are available for reading BAM files into \R{}:
87
\Rfunction{scanBam} is a low-level function that returns a list of lists
88
and is not discussed further here. For details see ?\Rfunction{scanBam}
89
in the \Rpackage{Rsamtools} package.
91
\subsection{Single-end reads}
92
Single-end reads can be loaded with the \Rfunction{readGAlignments} function.
95
library(GenomicAlignments)
96
gal <- readGAlignments(un1)
99
Data subsets can be specified by genomic position, field names, or flag
100
criteria in the \Rcode{ScanBamParam}. Here we input records that overlap
101
position 1 to 5000 on the negative strand with \Rcode{flag} and
102
\Rcode{cigar} as metadata columns.
105
what <- c("flag", "cigar")
106
which <- GRanges("chr4", IRanges(1, 5000))
107
flag <- scanBamFlag(isMinusStrand = TRUE)
108
param <- ScanBamParam(which=which, what=what, flag=flag)
109
neg <- readGAlignments(un1, param=param)
113
Another approach to subsetting the data is to use \Rfunction{filterBam}.
114
This function creates a new BAM file of records passing user-defined
115
criteria. See ?\Rfunction{filterBam} for details.
117
\subsection{Paired-end reads}
119
Paired-end reads can be loaded with \Rfunction{readGAlignmentPairs}
120
or \Rfunction{readGAlignmentsList}. These functions use the same
121
mate paring algorithm but output different objects.
123
Let's start with \Rfunction{readGAlignmentPairs}:
125
<<readGAlignmentPairs>>=
126
un3 <- untreated3_chr4()
127
gapairs <- readGAlignmentPairs(un3)
130
The \Robject{GAlignmentPairs} class holds only pairs; reads with no
131
mate or with ambiguous pairing are discarded.
132
Each list element holds exactly 2 records (a mated pair). Records
133
can be accessed as the \Rcode{first} and\Rcode{last} segments in
134
a template or as \Rcode{left} and \Rcode{right} alignments.
135
See ?\Rfunction{GAlignmentPairs} for details.
141
For \Rcode{readGAlignmentsList}, mate pairing is performed when \Rcode{asMates}
142
is set to \Rcode{TRUE} on the \Rcode{BamFile} object, otherwise records are
143
treated as single-end.
145
<<readGAlignmentsList>>=
146
galist <- readGAlignmentsList(BamFile(un3, asMates=TRUE))
149
\Robject{GAlignmentsList} is a more general `list-like' structure
150
that holds mate pairs as well as non-mates (i.e., singletons, records
151
with unmapped mates etc.) A \Rcode{mates} metadata column (accessed
152
with \Rfunction{mcols}) indicates which records were paired and is set
153
on both the individual \Robject{GAlignments} and the outer list elements.
159
Non-mated reads are returned as groups by QNAME and contain any number
160
of records. Here the non-mate groups range in size from 1 to 9.
163
non_mates <- galist[unlist(mcols(galist)$mates) == FALSE]
164
table(elementLengths(non_mates))
167
\subsection{Iterating with \Rcode{yieldSize}}
169
Large files can be iterated through in chunks by setting a \Rcode{yieldSize}
170
on the \Rcode{BamFile}.
173
bf <- BamFile(un1, yieldSize=100000)
176
Iteration through a BAM file requires that the file be opened, repeatedly
177
queried inside a loop, then closed. Repeated calls to
178
\Rfunction{readGAlignments} without opening the file first result
179
in the same 100000 records returned each time.
181
<<readGAlignments_by_chunk>>=
185
chunk <- readGAlignments(bf)
186
if (length(chunk) == 0L)
188
chunk_cvg <- coverage(chunk)
192
cvg <- cvg + chunk_cvg
200
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
201
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
203
\section{How to prepare a table of read counts for RNA-Seq differential
206
Methods for RNA-Seq gene expression analysis generally require a table of
207
counts that summarize the number of reads that overlap or `hit' a
208
particular gene. In this section we count with \Rcode{summarizeOverlaps}
209
and create a count table from the results.
211
Other packages that provide read counting are \Biocpkg{Rsubread} and
212
\Biocpkg{easyRNASeq}. The \Biocpkg{parathyroidSE} package vignette
213
contains a workflow on counting and other common operations required for
214
differential expression analysis.
216
\subsection{Counting with \Rfunction{summarizeOverlaps}}
218
As sample data we use \Biocpkg{pasillaBamSubset} which contains both
219
a single-end BAM (untreated1\_chr4) and a paired-end BAM
220
(untreated3\_chr4). Each file is a subset of chr4 from the "Pasilla"
221
experiment. See ?\Biocpkg{pasillaBamSubset} for details.
223
<<load, results=hide>>=
224
library(pasillaBamSubset)
225
un1 <- untreated1_chr4() ## single-end records
228
\Rcode{summarizeOverlaps} requires the name of a BAM file(s) and an
229
annotation to count against. The annotation must match the genome
230
build the BAM records were aligned to. For the pasilla data this is
231
dm3 Dmelanogaster which is available as a \Bioconductor{} package.
232
Load the package and extract the exon ranges by gene.
235
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
236
exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene")
239
\Rcode{summarizeOverlaps} automatically sets a \Rcode{yieldSize}
240
on large BAM files and iterates over them in chunks. When reading
241
paired-end data set the \Rcode{singleEnd} argument to FALSE.
242
See ?\Rfunction{summarizeOverlaps} for details reguarding the
243
count \Rcode{modes} and additional arguments.
246
library(GenomicAlignments)
247
se <- summarizeOverlaps(exbygene, un1, mode="IntersectionNotEmpty")
250
The return object is a \Rcode{SummarizedExperiment} with counts in
251
the \Rcode{assays} slot.
255
head(table(assays(se)$counts))
258
The count vector is the same length as the annotation.
261
identical(length(exbygene), length(assays(se)$counts))
264
The annotation is stored in the \Rcode{rowData} slot.
270
\subsection{Retrieving annotations from \Biocpkg{AnnotationHub}}
272
When the annotation is not available as a \Rcode{GRanges} or a
273
\Bioconductor{} package it may be available in \Rcode{AnnotationHub}.
274
Create a `hub' and filter on Drosophila melanogaster.
277
library(AnnotationHub)
278
hub <- AnnotationHub()
279
filters(hub) <- list(Species="Drosophila melanogaster")
282
There are 87 files that match Drosophila melanogaster.
289
Retrieve a dm3 file as a \Rcode{GRanges}.
292
gr <- hub$goldenpath.dm3.database.ensGene_0.0.1.RData
296
The metadata fields contain the details of file origin and content.
299
names(metadata(gr)[[2]])
300
metadata(gr)[[2]]$Tags
303
Split the GRanges by gene name to get a \Robject{GRangesList} of
310
Before performing overlap operations confirm that the seqlevels
311
(chromosome names) in the annotation match those in the BAM file. See
312
?\Rfunction{renameSeqlevels}, ?\Rfunction{keepSeqlevels} and
313
?\Rfunction{seqlevels} for examples of renaming seqlevels.
315
\subsection{Count tables}
317
Two popular packages for gene expression are \Biocpkg{DESeq} and
318
\Biocpkg{edgeR}. Tables of counts per gene are required for both and can be
319
easily created with a vector of counts. Here we use the counts from the
320
\Robject{SummarizedExperiment}.
324
deseq <- newCountDataSet(assays(se)$counts, rownames(colData(se)))
326
edger <- DGEList(assays(se)$counts, group=rownames(colData(se)))
330
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
331
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
333
\section{How to extract DNA sequences of gene regions}
335
\subsection{DNA sequences for intron and exon regions of a single gene}
337
DNA sequences for the introns and exons of a gene are essentially
338
the sequences for the introns and exons for all known transcripts of
339
a gene. The first task is to identify all transcripts associated
340
with the gene of interest. Our sample gene is the human TRAK2
341
which is involved in regulation of endosome-to-lysosome trafficking
342
of membrane cargo. The Entrez gene id is `66008'.
348
Load the UCSC `Known Gene' table annotation available as a
349
\Bioconductor{} package.
352
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
353
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
356
To get the transcripts associated with the trak2 gene we use the
357
\Rfunction{transcriptsBy} function from the \Biocpkg{GenomicFeatures}
358
package. This returns a \Robject{GRangesList} of all transcripts
359
grouped by gene. We are only interested in trak2 so we subset the
360
list on the trak2 gene id.
363
library(GenomicFeatures)
364
txbygene <- transcriptsBy(txdb, by="gene")[trak2]
368
The transcript names corresponding to the trak2 gene will be used to
369
subset the extracted intron and exon regions. The \Rcode{txbygene}
370
object is a \Robject{GRangesList} and the transcript names are a
371
metadata column on the individual \Robject{GRanges}. To extract the
372
names we must first `flatten' or unlist \Rcode{txbygene}.
375
tx_names <- mcols(unlist(txbygene))$tx_name
379
Intron and exon regions are extracted with \Rfunction{intronsByTranscript}
380
and \Rfunction{exonsBy}. The resulting \Robject{GRangesLists} are
381
subset on the trak2 transcript names.
383
Extract the intron regions ...
386
intronsbytx <- intronsByTranscript(txdb, use.names=TRUE)[tx_names]
387
elementLengths(intronsbytx)
390
and the exon regions.
393
exonsbytx <- exonsBy(txdb, "tx", use.names=TRUE)[tx_names]
394
elementLengths(exonsbytx)
397
Next we want the DNA sequences for these intron and exon regions.
398
The \Rfunction{extractTranscriptSeqs} function in the
399
\Biocpkg{Biostrings} package will query a \Biocpkg{BSGenome} package
400
with a set of genomic positions and retrieve the DNA sequences.
404
library(BSgenome.Hsapiens.UCSC.hg19)
407
Extract the intron sequences ...
410
intron_seqs <- extractTranscriptSeqs(Hsapiens, intronsbytx)
414
and the exon sequences.
417
exon_seqs <- extractTranscriptSeqs(Hsapiens, exonsbytx)
422
\subsection{DNA sequences for coding and UTR regions of genes associated
423
with colorectal cancer}
425
In this section we extract the coding and UTR sequences of genes involved
426
in colorectal cancer. The workflow extends the ideas presented in the single
427
gene example and suggests an approach to identify disease-related genes.
429
\subsubsection{Build a gene list}
431
We start with a list of gene or transcript ids. If you do not have
432
pre-defined list one can be created with the \Biocpkg{KEGG.db} and
433
\Biocpkg{KEGGgraph} packages. Updates to the data in the \Biocpkg{KEGG.db}
434
package are no longer available, however, the resource is still useful for
435
identifying pathway names and ids.
437
Create a table of KEGG pathways and ids and search on the term `cancer'.
441
pathways <- toTable(KEGGPATHNAME2ID)
442
pathways[grepl("cancer", pathways$path_name, fixed=TRUE),]
445
Use the "05210" id to query the KEGG web resource (accesses the currently
451
retrieveKGML("05200", "hsa", dest, "internal")
454
The suffix of the KEGG id is the Entrez gene id. The
455
\Rfunction{translateKEGGID2GeneID} simply removes the prefix leaving
456
just the Entrez gene ids.
459
crids <- as.character(parseKGML2DataFrame(dest)[,1])
460
crgenes <- unique(translateKEGGID2GeneID(crids))
464
\subsubsection{Identify genomic coordinates}
466
The list of gene ids is used to extract genomic positions of the regions
467
of interest. The Known Gene table from UCSC will be the annotation and
468
is available as a \Bioconductor{} package.
471
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
472
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
475
If an annotation is not available as a \Bioconductor{} annotation package
476
it may be available in \Biocpkg{AnnotationHub}. Additionally, there are
477
functions in \Biocpkg{GenomicFeatures} which can retrieve data from UCSC and
478
Ensembl to create a \Robject{TranscriptDb}. See
479
?\Rcode{makeTranscriptDbFromUCSC} for details.
481
As in the single gene example we need to identify the transcripts
482
corresponding to each gene. The transcript id (or name) is used
483
to isolate the UTR and coding regions of interest. This grouping of
484
transcript by gene is also used to re-group the final sequence results.
486
The \Rcode{transcriptsBy} function outputs both the gene and transcript
487
identifiers which we use to create a map between the two. The
488
\Rcode{map} is a \Robject{CharacterList} with gene ids as names and
489
transcript ids as the list elements.
492
txbygene <- transcriptsBy(txdb, "gene")[crgenes] ## subset on colorectal genes
493
map <- relist(unlist(txbygene, use.names=FALSE)$tx_id, txbygene)
497
Extract the UTR and coding regions.
500
cds <- cdsBy(txdb, "tx")
501
threeUTR <- threeUTRsByTranscript(txdb)
502
fiveUTR <- fiveUTRsByTranscript(txdb)
505
Coding and UTR regions may not be present for all transcripts specified
506
in \Rcode{map}. Consequently, the subset results will not be the same
507
length. This length discrepancy must be taken into account when re-listing
508
the final results by gene.
511
txid <- unlist(map, use.names=FALSE)
512
cds <- cds[names(cds) %in% txid]
513
threeUTR <- threeUTR[names(threeUTR) %in% txid]
514
fiveUTR <- fiveUTR[names(fiveUTR) %in% txid]
517
Note the different lengths of the subset regions.
520
length(txid) ## all possible transcripts
526
These objects are \Robject{GRangesList}s with the transcript id as the
533
\subsubsection{Extract sequences from BSgenome}
535
The \Rcode{BSgenome} packages contain complete genome sequences
536
for a given organism.
538
Load the \Rcode{BSgenome} package for homo sapiens.
541
library(BSgenome.Hsapiens.UCSC.hg19)
542
genome <- BSgenome.Hsapiens.UCSC.hg19
545
Use \Rfunction{extractTranscriptSeqs} to extract the UTR and coding
546
regions from the \Rcode{BSGenome}. This function retrieves the sequences
547
for an any \Robject{GRanges} or \Robject{GRangesList} (i.e., not just
548
transcripts like the name implies).
551
threeUTR_seqs <- extractTranscriptSeqs(genome, threeUTR)
552
fiveUTR_seqs <- extractTranscriptSeqs(genome, fiveUTR)
553
cds_seqs <- extractTranscriptSeqs(genome, cds)
556
The return values are \Robject{DNAStringSet} objects.
562
Our final step is to collect the coding and UTR regions (currently
563
organzied by transcript) into groups by gene id. The \Rfunction{relist}
564
function groups the sequences of a \Robject{DNAStringSet} object into
565
a \Robject{DNAStringSetList} object, based on the specified \Rcode{skeleton}
566
argument. The \Rcode{skeleton} must be a list-like object and only its shape
567
(i.e. its element lengths) matters (its exact content is ignored). A simple
568
form of \Rcode{skeleton} is to use a partitioning object that we make by
569
specifying the size of each partition. The partitioning objects are different
570
for each type of region because not all transcripts had a coding or 3' or 5'
574
lst3 <- relist(threeUTR_seqs, PartitioningByWidth(sum(map %in% names(threeUTR))))
575
lst5 <- relist(fiveUTR_seqs, PartitioningByWidth(sum(map %in% names(fiveUTR))))
576
lstc <- relist(cds_seqs, PartitioningByWidth(sum(map %in% names(cds))))
579
There are 239 genes in \Rcode{map} each of which have 1 or more transcripts.
580
The table of element lengths shows how many genes have each number of
581
transcripts. For example, 47 genes have 1 transcript, 48 genes have 2 etc.
585
table(elementLengths(map))
588
The lists of DNA sequences all have the same length as \Rcode{map} but one or
589
more of the element lengths may be zero. This would indicate that data were
590
not available for that gene. The tables below show that there was at least
591
1 coding region available for all genes (i.e., none of the element lengths
592
are 0). However, both the 3' and 5' UTR results have element lengths of 0
593
which indicates no UTR data were available for that gene.
596
table(elementLengths(lstc))
597
table(elementLengths(lst3))
598
names(lst3)[elementLengths(lst3) == 0L] ## genes with no 3' UTR data
599
table(elementLengths(lst5))
600
names(lst5)[elementLengths(lst5) == 0L] ## genes with no 5' UTR data
604
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
605
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
607
\section{How to create DNA consensus sequences for read group `families'}
609
The motivation for this HOWTO comes from a study which explored the
610
dynamics of point mutations. The mutations of interest exist with a range
611
of frequencies in the control group (e.g., 0.1\% - 50\%). PCR and sequencing
612
error rates make it difficult to identify low frequency events
615
When a library is prepared with Nextera, random fragments are generated
616
followed by a few rounds of PCR. When the genome is large enough, reads
617
aligning to the same start position are likely descendant from the same
618
template fragment and should have identical sequences.
620
The goal is to elimininate noise by grouping the reads by common start
621
position and discarding those that do not exceed a certain threshold within
622
each family. A new consensus sequence will be created for each read group
625
\subsection{Sort reads into groups by start position}
627
Load the BAM file into a GAlignments object.
631
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
632
param <- ScanBamParam(what=c("seq", "qual"))
633
gal <- readGAlignmentsFromBam(bamfile, use.names=TRUE, param=param)
636
Use the \Rfunction{sequenceLayer} function to {\it lay} the query sequences
637
and quality strings on the reference.
640
qseq <- setNames(mcols(gal)$seq, names(gal))
641
qual <- setNames(mcols(gal)$qual, names(gal))
642
qseq_on_ref <- sequenceLayer(qseq, cigar(gal),
643
from="query", to="reference")
644
qual_on_ref <- sequenceLayer(qual, cigar(gal),
645
from="query", to="reference")
651
qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal))
652
qual_on_ref_by_chrom <- splitAsList(qual_on_ref, seqnames(gal))
653
pos_by_chrom <- splitAsList(start(gal), seqnames(gal))
656
For each chromosome generate one GRanges object that contains
657
unique alignment start positions and attach 3 metadata columns
658
to it: the number of reads, the query sequences, and the quality
662
gr_by_chrom <- lapply(seqlevels(gal),
665
qseq_on_ref2 <- qseq_on_ref_by_chrom[[seqname]]
666
qual_on_ref2 <- qual_on_ref_by_chrom[[seqname]]
667
pos2 <- pos_by_chrom[[seqname]]
668
qseq_on_ref_per_pos <- split(qseq_on_ref2, pos2)
669
qual_on_ref_per_pos <- split(qual_on_ref2, pos2)
670
nread <- elementLengths(qseq_on_ref_per_pos)
671
gr_mcols <- DataFrame(nread=unname(nread),
672
qseq_on_ref=unname(qseq_on_ref_per_pos),
673
qual_on_ref=unname(qual_on_ref_per_pos))
674
gr <- GRanges(Rle(seqname, nrow(gr_mcols)),
675
IRanges(as.integer(names(nread)), width=1))
676
mcols(gr) <- gr_mcols
677
seqlevels(gr) <- seqlevels(gal)
682
Combine all the GRanges objects obtained in (4) in 1 big GRanges
686
gr <- do.call(c, gr_by_chrom)
687
seqinfo(gr) <- seqinfo(gal)
690
`gr' is a GRanges object that contains unique alignment start positions:
696
Look at qseq\_on\_ref and qual\_on\_ref.
703
2 reads align to start position 13. Let's have a close look at their
707
mcols(gr)$qseq_on_ref[[6]]
713
mcols(gr)$qual_on_ref[[6]]
716
Note that the sequence and quality strings are those projected to the
717
reference so the first letter in those strings are on top of start
718
position 13, the 2nd letter on top of position 14, etc...
720
\subsection{Remove low frequency reads}
722
For each start position, remove reads with and under-represented sequence
723
(e.g. threshold = 20\% for the data used here which is low coverage).
724
A unique number is assigned to each unique sequence. This will make
725
future calculations easier and a little bit faster.
728
qseq_on_ref <- mcols(gr)$qseq_on_ref
729
tmp <- unlist(qseq_on_ref, use.names=FALSE)
730
qseq_on_ref_id <- relist(match(tmp, tmp), qseq_on_ref)
733
Quick look at `qseq\_on\_ref\_id':
734
It's an IntegerList object with the same length and "shape"
741
Remove the under represented ids from each list element of `qseq\_on\_ref\_id':
744
qseq_on_ref_id2 <- endoapply(qseq_on_ref_id,
745
function(ids) ids[countMatches(ids, ids) >= 0.2 * length(ids)])
748
Remove corresponding sequences from `qseq\_on\_ref':
751
tmp <- unlist(qseq_on_ref_id2, use.names=FALSE)
752
qseq_on_ref2 <- relist(unlist(qseq_on_ref, use.names=FALSE)[tmp],
756
\subsection{Create a consensus sequence for each read group family}
758
Compute 1 consensus matrix per chromosome:
761
split_factor <- rep.int(seqnames(gr), elementLengths(qseq_on_ref2))
762
qseq_on_ref2 <- unlist(qseq_on_ref2, use.names=FALSE)
763
qseq_on_ref2_by_chrom <- splitAsList(qseq_on_ref2, split_factor)
764
qseq_pos_by_chrom <- splitAsList(start(gr), split_factor)
766
cm_by_chrom <- lapply(names(qseq_pos_by_chrom),
768
consensusMatrix(qseq_on_ref2_by_chrom[[seqname]],
770
shift=qseq_pos_by_chrom[[seqname]]-1,
771
width=seqlengths(gr)[[seqname]]))
772
names(cm_by_chrom) <- names(qseq_pos_by_chrom)
775
'cm\_by\_chrom' is a list of consensus matrices. Each matrix has 17 rows
776
(1 per letter in the DNA alphabet) and 1 column per chromosome position.
779
lapply(cm_by_chrom, dim)
782
Compute the consensus string from each consensus matrix. We'll put "+"
783
in the strings wherever there is no coverage for that position, and "N"
784
where there is coverage but no consensus.
787
cs_by_chrom <- lapply(cm_by_chrom,
789
## need to "fix" 'cm' because consensusString()
790
## doesn't like consensus matrices with columns
791
## that contain only zeroes (e.g., chromosome
792
## positions with no coverage)
793
idx <- colSums(cm) == 0L
795
DNAString(consensusString(cm, ambiguityMap="N"))
799
The new consensus strings.
806
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
807
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
809
\section{How to compute binned averages along a genome}
811
In some applications, there is the need to compute the average of a
812
variable along a genome for a set of predefined fixed-width regions
813
(sometimes called "bins"). One such example is coverage.
814
Coverage is an \Robject{RleList} with one list element per
815
chromosome. Here we simulate a coverage list.
818
library(BSgenome.Scerevisiae.UCSC.sacCer2)
821
lapply(seqlengths(Scerevisiae),
822
function(len) Rle(sample(-10:10, len, replace=TRUE))),
827
Use the \Rfunction{tileGenome} function to create a set of bins along
831
bins1 <- tileGenome(seqinfo(Scerevisiae), tilewidth=100,
832
cut.last.tile.in.chrom=TRUE)
835
We define the following function to compute the binned average of a
836
numerical variable defined along a genome.
840
'bins': a GRanges object representing the genomic bins.
841
Typically obtained by calling tileGenome() with
842
'cut.last.tile.in.chrom=TRUE'.
843
'numvar': a named RleList object representing a numerical
844
variable defined along the genome covered by 'bins', which
845
is the genome described by 'seqinfo(bins)'.
846
'mcolname': the name to give to the metadata column that will
847
contain the binned average in the returned object.
850
The function returns `bins' with an additional metadata column named
851
`mcolname' containing the binned average.
854
binnedAverage <- function(bins, numvar, mcolname)
856
stopifnot(is(bins, "GRanges"))
857
stopifnot(is(numvar, "RleList"))
858
stopifnot(identical(seqlevels(bins), names(numvar)))
859
bins_per_chrom <- split(ranges(bins), seqnames(bins))
860
means_list <- lapply(names(numvar),
862
views <- Views(numvar[[seqname]],
863
bins_per_chrom[[seqname]])
866
new_mcol <- unsplit(means_list, as.factor(seqnames(bins)))
867
mcols(bins)[[mcolname]] <- new_mcol
872
Compute the binned average for `cov':
875
bins1 <- binnedAverage(bins1, cov, "binned_cov")
879
The bin size can be modified with the \Rcode{tilewidth} argument
880
to \Rfunction{tileGenome}. For additional examples see
881
?\Rfunction{tileGenome}.
883
\section{Session Information}
885
<<SessionInfo, echo=FALSE>>=
889
\bibliography{GenomicRanges}
890
\bibliographystyle{plainnat}