~ubuntu-branches/ubuntu/wily/r-bioc-genomicranges/wily-proposed

%\VignetteDepends{GenomicRanges, Rsamtools, pasillaBamSubset, TxDb.Dmelanogaster.UCSC.dm3.ensGene, AnnotationHub, DESeq, edgeR, TxDb.Hsapiens.UCSC.hg19.knownGene, GenomicFeatures, Biostrings, BSgenome.Hsapiens.UCSC.hg19, KEGG.db, KEGGgraph, BSgenome.Scerevisiae.UCSC.sacCer2}

%\VignetteKeywords{sequence, sequencing, alignments}

%\VignettePackage{GenomicRanges}

\documentclass{article}

\usepackage[authoryear,round]{natbib}

<<style, eval=TRUE, echo=FALSE, results=tex>>=

BiocStyle::latex(use.unsrturl=FALSE)

\title{\Biocpkg{GenomicRanges} HOWTOs}

\author{Bioconductor Team}

\date{Edited: October 2013; Compiled: \today}

\begin{document}

\maketitle

\tableofcontents

<<options, echo=FALSE>>=

options(width=72)

options("showHeadLines" = 3)

options("showTailLines" = 3)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Introduction}

This vignette is a collection of {\it HOWTOs}. Each {\it HOWTO} is

a short section that demonstrates how to use the containers and

operations implemented in the \Biocpkg{GenomicRanges} and related

packages (\Biocpkg{IRanges}, \Biocpkg{GenomicFeatures},

\Biocpkg{Rsamtools}, and \Biocpkg{Biostrings}) to perform a task

typically found in the context of a high throughput sequence analysis.

The HOWTOs are self contained, independent of each other, and can be

studied and reproduced in any order.

We assume the reader has some previous experience with \R{} and

with basic manipulation of \Rcode{GRanges}, \Rcode{GRangesList}, \Rcode{Rle},

\Rcode{RleList}, and \Rcode{DataFrame} objects. See the ``An Introduction

to Genomic Ranges Classes'' vignette located in the \Biocpkg{GenomicRanges}

package (in the same folder as this document) for an introduction to these

containers.

Additional recommended readings after this vignette are the ``Software for

Computing and Annotating Genomic Ranges'' paper[\citet{Lawrence2013ranges}]

and the ``Counting reads with \Rfunction{summarizeOverlaps}'' vignette

located in the \Biocpkg{GenomicAlignments} package.

To display the list of vignettes available in the \Biocpkg{GenomicRanges},

use \Rcode{browseVignettes("GenomicRanges")}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{How to read BAM files into \R{}}

As sample data we use the \Biocpkg{pasillaBamSubset} data package which

contains both a BAM file with single-end reads (untreated1\_chr4) and

a BAM file with paired-end reads (untreated3\_chr4). Each file is a subset

of chr4 from the "Pasilla" experiment. See ?\Biocpkg{pasillaBamSubset}

for details.

<<load, results=hide>>=

library(pasillaBamSubset)

un1 <- untreated1_chr4() ## single-end reads

Several functions are available for reading BAM files into \R{}:

\begin{verbatim}

scanBam()

readGAlignments()

readGAlignmentPairs()

readGAlignmentsList()

\end{verbatim}

\Rfunction{scanBam} is a low-level function that returns a list of lists

and is not discussed further here. For details see ?\Rfunction{scanBam}

in the \Rpackage{Rsamtools} package.

\subsection{Single-end reads}

Single-end reads can be loaded with the \Rfunction{readGAlignments} function.

<<readGAlignments>>=

library(GenomicAlignments)

gal <- readGAlignments(un1)

Data subsets can be specified by genomic position, field names, or flag

100

criteria in the \Rcode{ScanBamParam}. Here we input records that overlap

101

position 1 to 5000 on the negative strand with \Rcode{flag} and

102

\Rcode{cigar} as metadata columns.

103

104

<<ScanBamParam>>=

105

what <- c("flag", "cigar")

106

which <- GRanges("chr4", IRanges(1, 5000))

107

flag <- scanBamFlag(isMinusStrand = TRUE)

108

param <- ScanBamParam(which=which, what=what, flag=flag)

109

neg <- readGAlignments(un1, param=param)

110

neg

111

112

113

Another approach to subsetting the data is to use \Rfunction{filterBam}.

114

This function creates a new BAM file of records passing user-defined

115

criteria. See ?\Rfunction{filterBam} for details.

116

117

\subsection{Paired-end reads}

118

119

Paired-end reads can be loaded with \Rfunction{readGAlignmentPairs}

120

or \Rfunction{readGAlignmentsList}. These functions use the same

121

mate paring algorithm but output different objects.

122

123

Let's start with \Rfunction{readGAlignmentPairs}:

124

125

<<readGAlignmentPairs>>=

126

un3 <- untreated3_chr4()

127

gapairs <- readGAlignmentPairs(un3)

128

129

130

The \Robject{GAlignmentPairs} class holds only pairs; reads with no

131

mate or with ambiguous pairing are discarded.

132

Each list element holds exactly 2 records (a mated pair). Records

133

can be accessed as the \Rcode{first} and\Rcode{last} segments in

134

a template or as \Rcode{left} and \Rcode{right} alignments.

135

See ?\Rfunction{GAlignmentPairs} for details.

136

137

<<gapairs>>=

138

gapairs

139

140

141

For \Rcode{readGAlignmentsList}, mate pairing is performed when \Rcode{asMates}

142

is set to \Rcode{TRUE} on the \Rcode{BamFile} object, otherwise records are

143

treated as single-end.

144

145

<<readGAlignmentsList>>=

146

galist <- readGAlignmentsList(BamFile(un3, asMates=TRUE))

147

148

149

\Robject{GAlignmentsList} is a more general `list-like' structure

150

that holds mate pairs as well as non-mates (i.e., singletons, records

151

with unmapped mates etc.) A \Rcode{mates} metadata column (accessed

152

with \Rfunction{mcols}) indicates which records were paired and is set

153

on both the individual \Robject{GAlignments} and the outer list elements.

154

155

<<galist>>=

156

galist

157

158

159

Non-mated reads are returned as groups by QNAME and contain any number

160

of records. Here the non-mate groups range in size from 1 to 9.

161

162

<<non_mates>>=

163

non_mates <- galist[unlist(mcols(galist)$mates) == FALSE]

164

table(elementLengths(non_mates))

165

166

167

\subsection{Iterating with \Rcode{yieldSize}}

168

169

Large files can be iterated through in chunks by setting a \Rcode{yieldSize}

170

on the \Rcode{BamFile}.

171

172

<<yieldSize>>=

173

bf <- BamFile(un1, yieldSize=100000)

174

175

176

Iteration through a BAM file requires that the file be opened, repeatedly

177

queried inside a loop, then closed. Repeated calls to

178

\Rfunction{readGAlignments} without opening the file first result

179

in the same 100000 records returned each time.

180

181

<<readGAlignments_by_chunk>>=

182

open(bf)

183

cvg <- NULL

184

repeat {

185

chunk <- readGAlignments(bf)

186

if (length(chunk) == 0L)

187

break

188

chunk_cvg <- coverage(chunk)

189

if (is.null(cvg)) {

190

cvg <- chunk_cvg

191

} else {

192

cvg <- cvg + chunk_cvg

193

}

194

}

195

close(bf)

196

cvg

197

198

199

200

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

201

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

202

203

\section{How to prepare a table of read counts for RNA-Seq differential

204

gene expression}

205

206

Methods for RNA-Seq gene expression analysis generally require a table of

207

counts that summarize the number of reads that overlap or `hit' a

208

particular gene. In this section we count with \Rcode{summarizeOverlaps}

209

and create a count table from the results.

210

211

Other packages that provide read counting are \Biocpkg{Rsubread} and

212

\Biocpkg{easyRNASeq}. The \Biocpkg{parathyroidSE} package vignette

213

contains a workflow on counting and other common operations required for

214

differential expression analysis.

215

216

\subsection{Counting with \Rfunction{summarizeOverlaps}}

217

218

As sample data we use \Biocpkg{pasillaBamSubset} which contains both

219

a single-end BAM (untreated1\_chr4) and a paired-end BAM

220

(untreated3\_chr4). Each file is a subset of chr4 from the "Pasilla"

221

experiment. See ?\Biocpkg{pasillaBamSubset} for details.

222

223

<<load, results=hide>>=

224

library(pasillaBamSubset)

225

un1 <- untreated1_chr4() ## single-end records

226

227

228

\Rcode{summarizeOverlaps} requires the name of a BAM file(s) and an

229

annotation to count against. The annotation must match the genome

230

build the BAM records were aligned to. For the pasilla data this is

231

dm3 Dmelanogaster which is available as a \Bioconductor{} package.

232

Load the package and extract the exon ranges by gene.

233

234

<<count_1>>=

235

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)

236

exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene")

237

238

239

\Rcode{summarizeOverlaps} automatically sets a \Rcode{yieldSize}

240

on large BAM files and iterates over them in chunks. When reading

241

paired-end data set the \Rcode{singleEnd} argument to FALSE.

242

See ?\Rfunction{summarizeOverlaps} for details reguarding the

243

count \Rcode{modes} and additional arguments.

244

245

<<count_2>>=

246

library(GenomicAlignments)

247

se <- summarizeOverlaps(exbygene, un1, mode="IntersectionNotEmpty")

248

249

250

The return object is a \Rcode{SummarizedExperiment} with counts in

251

the \Rcode{assays} slot.

252

253

<<count_3>>=

254

class(se)

255

head(table(assays(se)$counts))

256

257

258

The count vector is the same length as the annotation.

259

260

<<count_4>>=

261

identical(length(exbygene), length(assays(se)$counts))

262

263

264

The annotation is stored in the \Rcode{rowData} slot.

265

266

<<count_5>>=

267

rowData(se)

268

269

270

\subsection{Retrieving annotations from \Biocpkg{AnnotationHub}}

271

272

When the annotation is not available as a \Rcode{GRanges} or a

273

\Bioconductor{} package it may be available in \Rcode{AnnotationHub}.

274

Create a `hub' and filter on Drosophila melanogaster.

275

276

<<hub_1>>=

277

library(AnnotationHub)

278

hub <- AnnotationHub()

279

filters(hub) <- list(Species="Drosophila melanogaster")

280

281

282

There are 87 files that match Drosophila melanogaster.

283

284

<<hub_2>>=

285

length(hub)

286

head(names(hub))

287

288

289

Retrieve a dm3 file as a \Rcode{GRanges}.

290

291

<<hub_3>>=

292

gr <- hub$goldenpath.dm3.database.ensGene_0.0.1.RData

293

summary(gr)

294

295

296

The metadata fields contain the details of file origin and content.

297

298

<<hub_4>>=

299

names(metadata(gr)[[2]])

300

metadata(gr)[[2]]$Tags

301

302

303

Split the GRanges by gene name to get a \Robject{GRangesList} of

304

transcripts by gene.

305

306

<<hub_5>>=

307

split(gr, gr$name)

308

309

310

Before performing overlap operations confirm that the seqlevels

311

(chromosome names) in the annotation match those in the BAM file. See

312

?\Rfunction{renameSeqlevels}, ?\Rfunction{keepSeqlevels} and

313

?\Rfunction{seqlevels} for examples of renaming seqlevels.

314

315

\subsection{Count tables}

316

317

Two popular packages for gene expression are \Biocpkg{DESeq} and

318

\Biocpkg{edgeR}. Tables of counts per gene are required for both and can be

319

easily created with a vector of counts. Here we use the counts from the

320

\Robject{SummarizedExperiment}.

321

322

<<count_table>>=

323

library(DESeq)

324

deseq <- newCountDataSet(assays(se)$counts, rownames(colData(se)))

325

library(edgeR)

326

edger <- DGEList(assays(se)$counts, group=rownames(colData(se)))

327

328

329

330

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

331

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

332

333

\section{How to extract DNA sequences of gene regions}

334

335

\subsection{DNA sequences for intron and exon regions of a single gene}

336

337

DNA sequences for the introns and exons of a gene are essentially

338

the sequences for the introns and exons for all known transcripts of

339

a gene. The first task is to identify all transcripts associated

340

with the gene of interest. Our sample gene is the human TRAK2

341

which is involved in regulation of endosome-to-lysosome trafficking

342

of membrane cargo. The Entrez gene id is `66008'.

343

344

<<trak_1>>=

345

trak2 <- "66008"

346

347

348

Load the UCSC `Known Gene' table annotation available as a

349

\Bioconductor{} package.

350

351

<<trak_2>>=

352

library(TxDb.Hsapiens.UCSC.hg19.knownGene)

353

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

354

355

356

To get the transcripts associated with the trak2 gene we use the

357

\Rfunction{transcriptsBy} function from the \Biocpkg{GenomicFeatures}

358

package. This returns a \Robject{GRangesList} of all transcripts

359

grouped by gene. We are only interested in trak2 so we subset the

360

list on the trak2 gene id.

361

362

<<trak_3>>=

363

library(GenomicFeatures)

364

txbygene <- transcriptsBy(txdb, by="gene")[trak2]

365

txbygene

366

367

368

The transcript names corresponding to the trak2 gene will be used to

369

subset the extracted intron and exon regions. The \Rcode{txbygene}

370

object is a \Robject{GRangesList} and the transcript names are a

371

metadata column on the individual \Robject{GRanges}. To extract the

372

names we must first `flatten' or unlist \Rcode{txbygene}.

373

374

<<trak_4>>=

375

tx_names <- mcols(unlist(txbygene))$tx_name

376

tx_names

377

378

379

Intron and exon regions are extracted with \Rfunction{intronsByTranscript}

380

and \Rfunction{exonsBy}. The resulting \Robject{GRangesLists} are

381

subset on the trak2 transcript names.

382

383

Extract the intron regions ...

384

385

<<trak_5>>=

386

intronsbytx <- intronsByTranscript(txdb, use.names=TRUE)[tx_names]

387

elementLengths(intronsbytx)

388

389

390

and the exon regions.

391

392

<<trak_7>>=

393

exonsbytx <- exonsBy(txdb, "tx", use.names=TRUE)[tx_names]

394

elementLengths(exonsbytx)

395

396

397

Next we want the DNA sequences for these intron and exon regions.

398

The \Rfunction{extractTranscriptSeqs} function in the

399

\Biocpkg{Biostrings} package will query a \Biocpkg{BSGenome} package

400

with a set of genomic positions and retrieve the DNA sequences.

401

402

<<trak_8>>=

403

library(Biostrings)

404

library(BSgenome.Hsapiens.UCSC.hg19)

405

406

407

Extract the intron sequences ...

408

409

<<trak_9>>=

410

intron_seqs <- extractTranscriptSeqs(Hsapiens, intronsbytx)

411

intron_seqs

412

413

414

and the exon sequences.

415

416

<<trak_10>>=

417

exon_seqs <- extractTranscriptSeqs(Hsapiens, exonsbytx)

418

exon_seqs

419

420

421

422

\subsection{DNA sequences for coding and UTR regions of genes associated

423

with colorectal cancer}

424

425

In this section we extract the coding and UTR sequences of genes involved

426

in colorectal cancer. The workflow extends the ideas presented in the single

427

gene example and suggests an approach to identify disease-related genes.

428

429

\subsubsection{Build a gene list}

430

431

We start with a list of gene or transcript ids. If you do not have

432

pre-defined list one can be created with the \Biocpkg{KEGG.db} and

433

\Biocpkg{KEGGgraph} packages. Updates to the data in the \Biocpkg{KEGG.db}

434

package are no longer available, however, the resource is still useful for

435

identifying pathway names and ids.

436

437

Create a table of KEGG pathways and ids and search on the term `cancer'.

438

439

<<cancer_1>>=

440

library(KEGG.db)

441

pathways <- toTable(KEGGPATHNAME2ID)

442

pathways[grepl("cancer", pathways$path_name, fixed=TRUE),]

443

444

445

Use the "05210" id to query the KEGG web resource (accesses the currently

446

maintained data).

447

448

<<cancer_2>>=

449

library(KEGGgraph)

450

dest <- tempfile()

451

retrieveKGML("05200", "hsa", dest, "internal")

452

453

454

The suffix of the KEGG id is the Entrez gene id. The

455

\Rfunction{translateKEGGID2GeneID} simply removes the prefix leaving

456

just the Entrez gene ids.

457

458

<<cancer_3>>=

459

crids <- as.character(parseKGML2DataFrame(dest)[,1])

460

crgenes <- unique(translateKEGGID2GeneID(crids))

461

head(crgenes)

462

463

464

\subsubsection{Identify genomic coordinates}

465

466

The list of gene ids is used to extract genomic positions of the regions

467

of interest. The Known Gene table from UCSC will be the annotation and

468

is available as a \Bioconductor{} package.

469

470

<<cancer_4>>=

471

library(TxDb.Hsapiens.UCSC.hg19.knownGene)

472

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

473

474

475

If an annotation is not available as a \Bioconductor{} annotation package

476

it may be available in \Biocpkg{AnnotationHub}. Additionally, there are

477

functions in \Biocpkg{GenomicFeatures} which can retrieve data from UCSC and

478

Ensembl to create a \Robject{TranscriptDb}. See

479

?\Rcode{makeTranscriptDbFromUCSC} for details.

480

481

As in the single gene example we need to identify the transcripts

482

corresponding to each gene. The transcript id (or name) is used

483

to isolate the UTR and coding regions of interest. This grouping of

484

transcript by gene is also used to re-group the final sequence results.

485

486

The \Rcode{transcriptsBy} function outputs both the gene and transcript

487

identifiers which we use to create a map between the two. The

488

\Rcode{map} is a \Robject{CharacterList} with gene ids as names and

489

transcript ids as the list elements.

490

491

<<cancer_5>>=

492

txbygene <- transcriptsBy(txdb, "gene")[crgenes] ## subset on colorectal genes

493

map <- relist(unlist(txbygene, use.names=FALSE)$tx_id, txbygene)

494

map

495

496

497

Extract the UTR and coding regions.

498

499

<<cancer_6>>=

500

cds <- cdsBy(txdb, "tx")

501

threeUTR <- threeUTRsByTranscript(txdb)

502

fiveUTR <- fiveUTRsByTranscript(txdb)

503

504

505

Coding and UTR regions may not be present for all transcripts specified

506

in \Rcode{map}. Consequently, the subset results will not be the same

507

length. This length discrepancy must be taken into account when re-listing

508

the final results by gene.

509

510

<<cancer_7>>=

511

txid <- unlist(map, use.names=FALSE)

512

cds <- cds[names(cds) %in% txid]

513

threeUTR <- threeUTR[names(threeUTR) %in% txid]

514

fiveUTR <- fiveUTR[names(fiveUTR) %in% txid]

515

516

517

Note the different lengths of the subset regions.

518

519

<<cancer_8>>=

520

length(txid) ## all possible transcripts

521

length(cds)

522

length(threeUTR)

523

length(fiveUTR)

524

525

526

These objects are \Robject{GRangesList}s with the transcript id as the

527

outer list element.

528

529

<<cancer_9>>=

530

cds

531

532

533

\subsubsection{Extract sequences from BSgenome}

534

535

The \Rcode{BSgenome} packages contain complete genome sequences

536

for a given organism.

537

538

Load the \Rcode{BSgenome} package for homo sapiens.

539

540

<<cancer_10>>=

541

library(BSgenome.Hsapiens.UCSC.hg19)

542

genome <- BSgenome.Hsapiens.UCSC.hg19

543

544

545

Use \Rfunction{extractTranscriptSeqs} to extract the UTR and coding

546

regions from the \Rcode{BSGenome}. This function retrieves the sequences

547

for an any \Robject{GRanges} or \Robject{GRangesList} (i.e., not just

548

transcripts like the name implies).

549

550

<<cancer_11>>=

551

threeUTR_seqs <- extractTranscriptSeqs(genome, threeUTR)

552

fiveUTR_seqs <- extractTranscriptSeqs(genome, fiveUTR)

553

cds_seqs <- extractTranscriptSeqs(genome, cds)

554

555

556

The return values are \Robject{DNAStringSet} objects.

557

558

<<cancer_12>>=

559

cds_seqs

560

561

562

Our final step is to collect the coding and UTR regions (currently

563

organzied by transcript) into groups by gene id. The \Rfunction{relist}

564

function groups the sequences of a \Robject{DNAStringSet} object into

565

a \Robject{DNAStringSetList} object, based on the specified \Rcode{skeleton}

566

argument. The \Rcode{skeleton} must be a list-like object and only its shape

567

(i.e. its element lengths) matters (its exact content is ignored). A simple

568

form of \Rcode{skeleton} is to use a partitioning object that we make by

569

specifying the size of each partition. The partitioning objects are different

570

for each type of region because not all transcripts had a coding or 3' or 5'

571

UTR region defined.

572

573

<<cancer_13>>=

574

lst3 <- relist(threeUTR_seqs, PartitioningByWidth(sum(map %in% names(threeUTR))))

575

lst5 <- relist(fiveUTR_seqs, PartitioningByWidth(sum(map %in% names(fiveUTR))))

576

lstc <- relist(cds_seqs, PartitioningByWidth(sum(map %in% names(cds))))

577

578

579

There are 239 genes in \Rcode{map} each of which have 1 or more transcripts.

580

The table of element lengths shows how many genes have each number of

581

transcripts. For example, 47 genes have 1 transcript, 48 genes have 2 etc.

582

583

<<cancer_14>>=

584

length(map)

585

table(elementLengths(map))

586

587

588

The lists of DNA sequences all have the same length as \Rcode{map} but one or

589

more of the element lengths may be zero. This would indicate that data were

590

not available for that gene. The tables below show that there was at least

591

1 coding region available for all genes (i.e., none of the element lengths

592

are 0). However, both the 3' and 5' UTR results have element lengths of 0

593

which indicates no UTR data were available for that gene.

594

595

<<cancer_15>>=

596

table(elementLengths(lstc))

597

table(elementLengths(lst3))

598

names(lst3)[elementLengths(lst3) == 0L] ## genes with no 3' UTR data

599

table(elementLengths(lst5))

600

names(lst5)[elementLengths(lst5) == 0L] ## genes with no 5' UTR data

601

602

603

604

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

605

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

606

607

\section{How to create DNA consensus sequences for read group `families'}

608

609

The motivation for this HOWTO comes from a study which explored the

610

dynamics of point mutations. The mutations of interest exist with a range

611

of frequencies in the control group (e.g., 0.1\% - 50\%). PCR and sequencing

612

error rates make it difficult to identify low frequency events

613

(e.g., < 20\%).

614

615

When a library is prepared with Nextera, random fragments are generated

616

followed by a few rounds of PCR. When the genome is large enough, reads

617

aligning to the same start position are likely descendant from the same

618

template fragment and should have identical sequences.

619

620

The goal is to elimininate noise by grouping the reads by common start

621

position and discarding those that do not exceed a certain threshold within

622

each family. A new consensus sequence will be created for each read group

623

family.

624

625

\subsection{Sort reads into groups by start position}

626

627

Load the BAM file into a GAlignments object.

628

629

<<cseq_1>>=

630

library(Rsamtools)

631

bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")

632

param <- ScanBamParam(what=c("seq", "qual"))

633

gal <- readGAlignmentsFromBam(bamfile, use.names=TRUE, param=param)

634

635

636

Use the \Rfunction{sequenceLayer} function to {\it lay} the query sequences

637

and quality strings on the reference.

638

639

<<cseq_2>>=

640

qseq <- setNames(mcols(gal)$seq, names(gal))

641

qual <- setNames(mcols(gal)$qual, names(gal))

642

qseq_on_ref <- sequenceLayer(qseq, cigar(gal),

643

from="query", to="reference")

644

qual_on_ref <- sequenceLayer(qual, cigar(gal),

645

from="query", to="reference")

646

647

648

Split by chromosome.

649

650

<<cseq_3>>=

651

qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal))

652

qual_on_ref_by_chrom <- splitAsList(qual_on_ref, seqnames(gal))

653

pos_by_chrom <- splitAsList(start(gal), seqnames(gal))

654

655

656

For each chromosome generate one GRanges object that contains

657

unique alignment start positions and attach 3 metadata columns

658

to it: the number of reads, the query sequences, and the quality

659

strings.

660

661

<<cseq_4>>=

662

gr_by_chrom <- lapply(seqlevels(gal),

663

function(seqname)

664

{

665

qseq_on_ref2 <- qseq_on_ref_by_chrom[[seqname]]

666

qual_on_ref2 <- qual_on_ref_by_chrom[[seqname]]

667

pos2 <- pos_by_chrom[[seqname]]

668

qseq_on_ref_per_pos <- split(qseq_on_ref2, pos2)

669

qual_on_ref_per_pos <- split(qual_on_ref2, pos2)

670

nread <- elementLengths(qseq_on_ref_per_pos)

671

gr_mcols <- DataFrame(nread=unname(nread),

672

qseq_on_ref=unname(qseq_on_ref_per_pos),

673

qual_on_ref=unname(qual_on_ref_per_pos))

674

gr <- GRanges(Rle(seqname, nrow(gr_mcols)),

675

IRanges(as.integer(names(nread)), width=1))

676

mcols(gr) <- gr_mcols

677

seqlevels(gr) <- seqlevels(gal)

678

679

})

680

681

682

Combine all the GRanges objects obtained in (4) in 1 big GRanges

683

object:

684

685

<<cseq_5>>=

686

gr <- do.call(c, gr_by_chrom)

687

seqinfo(gr) <- seqinfo(gal)

688

689

690

`gr' is a GRanges object that contains unique alignment start positions:

691

692

<<cseq_6>>=

693

gr[1:6]

694

695

696

Look at qseq\_on\_ref and qual\_on\_ref.

697

698

<<cseq_7>>=

699

qseq_on_ref

700

qual_on_ref

701

702

703

2 reads align to start position 13. Let's have a close look at their

704

sequences:

705

706

<<cseq_8>>=

707

mcols(gr)$qseq_on_ref[[6]]

708

709

710

and their qualities:

711

712

<<cseq_9>>=

713

mcols(gr)$qual_on_ref[[6]]

714

715

716

Note that the sequence and quality strings are those projected to the

717

reference so the first letter in those strings are on top of start

718

position 13, the 2nd letter on top of position 14, etc...

719

720

\subsection{Remove low frequency reads}

721

722

For each start position, remove reads with and under-represented sequence

723

(e.g. threshold = 20\% for the data used here which is low coverage).

724

A unique number is assigned to each unique sequence. This will make

725

future calculations easier and a little bit faster.

726

727

<<cseq_10>>=

728

qseq_on_ref <- mcols(gr)$qseq_on_ref

729

tmp <- unlist(qseq_on_ref, use.names=FALSE)

730

qseq_on_ref_id <- relist(match(tmp, tmp), qseq_on_ref)

731

732

733

Quick look at `qseq\_on\_ref\_id':

734

It's an IntegerList object with the same length and "shape"

735

as `qseq\_on\_ref'.

736

737

<<cseq_11>>=

738

qseq_on_ref_id

739

740

741

Remove the under represented ids from each list element of `qseq\_on\_ref\_id':

742

743

<<cseq_12>>=

744

qseq_on_ref_id2 <- endoapply(qseq_on_ref_id,

745

function(ids) ids[countMatches(ids, ids) >= 0.2 * length(ids)])

746

747

748

Remove corresponding sequences from `qseq\_on\_ref':

749

750

<<cseq_13>>=

751

tmp <- unlist(qseq_on_ref_id2, use.names=FALSE)

752

qseq_on_ref2 <- relist(unlist(qseq_on_ref, use.names=FALSE)[tmp],

753

qseq_on_ref_id2)

754

755

756

\subsection{Create a consensus sequence for each read group family}

757

758

Compute 1 consensus matrix per chromosome:

759

760

<<cseq_14>>=

761

split_factor <- rep.int(seqnames(gr), elementLengths(qseq_on_ref2))

762

qseq_on_ref2 <- unlist(qseq_on_ref2, use.names=FALSE)

763

qseq_on_ref2_by_chrom <- splitAsList(qseq_on_ref2, split_factor)

764

qseq_pos_by_chrom <- splitAsList(start(gr), split_factor)

765

766

cm_by_chrom <- lapply(names(qseq_pos_by_chrom),

767

function(seqname)

768

consensusMatrix(qseq_on_ref2_by_chrom[[seqname]],

769

as.prob=TRUE,

770

shift=qseq_pos_by_chrom[[seqname]]-1,

771

width=seqlengths(gr)[[seqname]]))

772

names(cm_by_chrom) <- names(qseq_pos_by_chrom)

773

774

775

'cm\_by\_chrom' is a list of consensus matrices. Each matrix has 17 rows

776

(1 per letter in the DNA alphabet) and 1 column per chromosome position.

777

778

<<cseq_15>>=

779

lapply(cm_by_chrom, dim)

780

781

782

Compute the consensus string from each consensus matrix. We'll put "+"

783

in the strings wherever there is no coverage for that position, and "N"

784

where there is coverage but no consensus.

785

786

<<cseq_16>>=

787

cs_by_chrom <- lapply(cm_by_chrom,

788

function(cm) {

789

## need to "fix" 'cm' because consensusString()

790

## doesn't like consensus matrices with columns

791

## that contain only zeroes (e.g., chromosome

792

## positions with no coverage)

793

idx <- colSums(cm) == 0L

794

cm["+", idx] <- 1

795

DNAString(consensusString(cm, ambiguityMap="N"))

796

})

797

798

799

The new consensus strings.

800

801

<<cseq_17>>=

802

cs_by_chrom

803

804

805

806

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

807

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

808

809

\section{How to compute binned averages along a genome}

810

811

In some applications, there is the need to compute the average of a

812

variable along a genome for a set of predefined fixed-width regions

813

(sometimes called "bins"). One such example is coverage.

814

Coverage is an \Robject{RleList} with one list element per

815

chromosome. Here we simulate a coverage list.

816

817

<<bim_1>>=

818

library(BSgenome.Scerevisiae.UCSC.sacCer2)

819

set.seed(22)

820

cov <- RleList(

821

lapply(seqlengths(Scerevisiae),

822

function(len) Rle(sample(-10:10, len, replace=TRUE))),

823

compress=FALSE)

824

head(cov, 3)

825

826

827

Use the \Rfunction{tileGenome} function to create a set of bins along

828

the genome.

829

830

<<bin_2>>=

831

bins1 <- tileGenome(seqinfo(Scerevisiae), tilewidth=100,

832

cut.last.tile.in.chrom=TRUE)

833

834

835

We define the following function to compute the binned average of a

836

numerical variable defined along a genome.

837

838

\begin{verbatim}

839

Arguments:

840

'bins': a GRanges object representing the genomic bins.

841

Typically obtained by calling tileGenome() with

842

'cut.last.tile.in.chrom=TRUE'.

843

'numvar': a named RleList object representing a numerical

844

variable defined along the genome covered by 'bins', which

845

is the genome described by 'seqinfo(bins)'.

846

'mcolname': the name to give to the metadata column that will

847

contain the binned average in the returned object.

848

\end{verbatim}

849

850

The function returns `bins' with an additional metadata column named

851

`mcolname' containing the binned average.

852

853

<<bin_3>>=

854

binnedAverage <- function(bins, numvar, mcolname)

855

{

856

stopifnot(is(bins, "GRanges"))

857

stopifnot(is(numvar, "RleList"))

858

stopifnot(identical(seqlevels(bins), names(numvar)))

859

bins_per_chrom <- split(ranges(bins), seqnames(bins))

860

means_list <- lapply(names(numvar),

861

function(seqname) {

862

views <- Views(numvar[[seqname]],

863

bins_per_chrom[[seqname]])

864

viewMeans(views)

865

})

866

new_mcol <- unsplit(means_list, as.factor(seqnames(bins)))

867

mcols(bins)[[mcolname]] <- new_mcol

868

bins

869

}

870

871

872

Compute the binned average for `cov':

873

874

<<bin_4>>=

875

bins1 <- binnedAverage(bins1, cov, "binned_cov")

876

bins1

877

878

879

The bin size can be modified with the \Rcode{tilewidth} argument

880

to \Rfunction{tileGenome}. For additional examples see

881

?\Rfunction{tileGenome}.

882

883

\section{Session Information}

884

885

<<SessionInfo, echo=FALSE>>=

886

sessionInfo()

887

888

889

\bibliography{GenomicRanges}

890

\bibliographystyle{plainnat}

891

892

\end{document}

Older »