~ubuntu-branches/ubuntu/raring/samtools/raring

Viewing changes to samtools.txt

Committer: Bazaar Package Importer
Author(s): Charles Plessy
Date: 2009-11-17 21:38:24 UTC
Revision ID: james.westby@ubuntu.com-20091117213824-dfouynpy3r7ismpj

Tags: 0.1.7a~dfsg-1

* New upstream release: new script sam2vcf.pl, and many other changes.
* Package converted to the format ‘3.0 (quilt)’ (debian/source/format).
* Wrote a manual page for razip (debian/razip.1).
* Better clean the example directory to make the source package
buildable twice in a row (debian/rules).

files added:
.bzr-builddeb

.bzr-builddeb/default.conf

debian/razip.1

debian/source

debian/source/format

kaln.c

kaln.h

klist.h

misc/sam2vcf.pl

sam_header.c

sam_header.h

files modified:
ChangeLog

Makefile

Makefile.mingw

NEWS

bam.c

bam.h

bam_aux.c

bam_import.c

bam_index.c

bam_maqcns.c

bam_maqcns.h

bam_md.c

bam_plcmd.c

bam_rmdup.c

bam_rmdupse.c

bam_sort.c

bamtk.c

bgzf.c

bgzip.c

debian/changelog

debian/copyright

debian/rules

debian/samtools.manpages

examples/Makefile

faidx.c

faidx.h

knetfile.c

knetfile.h

misc/novo2sam.pl

misc/samtools.pl

razf.c

razf.h

razip.c

sam.c

sam_view.c

samtools.1

samtools.txt

Show diffs side-by-side

added added

removed removed

samtools.txt

103

otherwise only alignments overlapping the specified regions

104

will be output. An alignment may be given multiple times if

105

it is overlapping several regions. A region can be presented,

106

for example, in the following format: `chr2', `chr2:1000000'

107

or `chr2:1,000,000-2,000,000'. The coordinate is 1-based.

106

for example, in the following format: `chr2' (the whole

107

chr2), `chr2:1000000' (region starting from 1,000,000bp) or

108

`chr2:1,000,000-2,000,000' (region between 1,000,000 and

109

2,000,000bp including the end points). The coordinate is

110

1-based.

108

111

109

112

OPTIONS:

110

113

111

114

-b Output in the BAM format.

112

115

113

116

-u Output uncompressed BAM. This option saves time spent

114

on compression/decomprssion and is thus preferred

117

on compression/decomprssion and is thus preferred

115

118

when the output is piped to another samtools command.

116

119

117

120

-h Include the header in the output.

118

121

119

122

-H Output the header only.

120

123

121

-S Input is in SAM. If @SQ header lines are absent, the

124

-S Input is in SAM. If @SQ header lines are absent, the

122

125

`-t' option is required.

123

126

124

-t FILE This file is TAB-delimited. Each line must contain

125

the reference name and the length of the reference,

126

one line for each distinct reference; additional

127

fields are ignored. This file also defines the order

128

of the reference sequences in sorting. If you run

129

`samtools faidx <ref.fa>', the resultant index file

130

<ref.fa>.fai can be used as this <in.ref_list> file.

127

-t FILE This file is TAB-delimited. Each line must contain

128

the reference name and the length of the reference,

129

one line for each distinct reference; additional

130

fields are ignored. This file also defines the order

131

of the reference sequences in sorting. If you run

132

`samtools faidx <ref.fa>', the resultant index file

133

<ref.fa>.fai can be used as this <in.ref_list> file.

131

134

132

135

-o FILE Output file [stdout]

133

136

134

-f INT Only output alignments with all bits in INT present

137

-f INT Only output alignments with all bits in INT present

135

138

in the FLAG field. INT can be in hex in the format of

136

139

/^0x[0-9A-F]+/ [0]

137

140

146

149

147

150

faidx samtools faidx <ref.fasta> [region1 [...]]

148

151

149

Index reference sequence in the FASTA format or extract sub-

150

sequence from indexed reference sequence. If no region is

152

Index reference sequence in the FASTA format or extract sub-

153

sequence from indexed reference sequence. If no region is

151

154

specified, faidx will index the file and create

152

<ref.fasta>.fai on the disk. If regions are speficified, the

153

subsequences will be retrieved and printed to stdout in the

154

FASTA format. The input file can be compressed in the RAZF

155

<ref.fasta>.fai on the disk. If regions are speficified, the

156

subsequences will be retrieved and printed to stdout in the

157

FASTA format. The input file can be compressed in the RAZF

155

158

format.

156

159

157

160

158

pileup samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l

159

in.site_list] [-iscgS2] [-T theta] [-N nHap] [-r

161

pileup samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l

162

in.site_list] [-iscgS2] [-T theta] [-N nHap] [-r

160

163

pairDiffRate] <in.bam>|<in.sam>

161

164

162

Print the alignment in the pileup format. In the pileup for-

163

mat, each line represents a genomic position, consisting of

165

Print the alignment in the pileup format. In the pileup for-

166

mat, each line represents a genomic position, consisting of

164

167

chromosome name, coordinate, reference base, read bases, read

165

qualities and alignment mapping qualities. Information on

168

qualities and alignment mapping qualities. Information on

166

169

match, mismatch, indel, strand, mapping quality and start and

167

end of a read are all encoded at the read base column. At

168

this column, a dot stands for a match to the reference base

169

on the forward strand, a comma for a match on the reverse

170

strand, `ACGTN' for a mismatch on the forward strand and

171

`acgtn' for a mismatch on the reverse strand. A pattern

172

`\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion

173

between this reference position and the next reference posi-

174

tion. The length of the insertion is given by the integer in

175

the pattern, followed by the inserted sequence. Similarly, a

170

end of a read are all encoded at the read base column. At

171

this column, a dot stands for a match to the reference base

172

on the forward strand, a comma for a match on the reverse

173

strand, `ACGTN' for a mismatch on the forward strand and

174

`acgtn' for a mismatch on the reverse strand. A pattern

175

`\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion

176

between this reference position and the next reference posi-

177

tion. The length of the insertion is given by the integer in

178

the pattern, followed by the inserted sequence. Similarly, a

176

179

pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the

177

reference. The deleted bases will be presented as `*' in the

178

following lines. Also at the read base column, a symbol `^'

179

marks the start of a read segment which is a contiguous sub-

180

sequence on the read separated by `N/S/H' CIGAR operations.

181

The ASCII of the character following `^' minus 33 gives the

182

mapping quality. A symbol `$' marks the end of a read seg-

180

reference. The deleted bases will be presented as `*' in the

181

following lines. Also at the read base column, a symbol `^'

182

marks the start of a read segment which is a contiguous sub-

183

sequence on the read separated by `N/S/H' CIGAR operations.

184

The ASCII of the character following `^' minus 33 gives the

185

mapping quality. A symbol `$' marks the end of a read seg-

183

186

ment.

184

187

185

If option -c is applied, the consensus base, consensus qual-

186

ity, SNP quality and RMS mapping quality of the reads cover-

187

ing the site will be inserted between the `reference base'

188

and the `read bases' columns. An indel occupies an additional

189

line. Each indel line consists of chromosome name, coordi-

190

nate, a star, the genotype, consensus quality, SNP quality,

188

If option -c is applied, the consensus base, Phred-scaled

189

consensus quality, SNP quality (i.e. the Phred-scaled proba-

190

bility of the consensus being identical to the reference) and

191

root mean square (RMS) mapping quality of the reads covering

192

the site will be inserted between the `reference base' and

193

the `read bases' columns. An indel occupies an additional

194

line. Each indel line consists of chromosome name, coordi-

195

nate, a star, the genotype, consensus quality, SNP quality,

191

196

RMS mapping quality, # covering reads, the first alllele, the

192

second allele, # reads supporting the first allele, # reads

193

supporting the second allele and # reads containing indels

197

second allele, # reads supporting the first allele, # reads

198

supporting the second allele and # reads containing indels

194

199

different from the top two alleles.

195

200

196

201

OPTIONS:

197

202

198

203

199

-s Print the mapping quality as the last column. This

200

option makes the output easier to parse, although

204

-s Print the mapping quality as the last column. This

205

option makes the output easier to parse, although

201

206

this format is not space efficient.

202

207

203

208

207

212

-i Only output pileup lines containing indels.

208

213

209

214

210

-f FILE The reference sequence in the FASTA format. Index

215

-f FILE The reference sequence in the FASTA format. Index

211

216

file FILE.fai will be created if absent.

212

217

213

218

214

219

-M INT Cap mapping quality at INT [60]

215

220

216

221

217

-t FILE List of reference names ane sequence lengths, in

218

the format described for the import command. If

219

this option is present, samtools assumes the input

222

-t FILE List of reference names ane sequence lengths, in

223

the format described for the import command. If

224

this option is present, samtools assumes the input

220

225

<in.alignment> is in SAM format; otherwise it

221

226

assumes in BAM format.

222

227

223

228

224

-l FILE List of sites at which pileup is output. This file

225

is space delimited. The first two columns are

226

required to be chromosome and 1-based coordinate.

227

Additional columns are ignored. It is recommended

229

-l FILE List of sites at which pileup is output. This file

230

is space delimited. The first two columns are

231

required to be chromosome and 1-based coordinate.

232

Additional columns are ignored. It is recommended

228

233

to use option -s together with -l as in the default

229

234

format we may not know the mapping quality.

230

235

231

236

232

-c Call the consensus sequence using MAQ consensus

237

-c Call the consensus sequence using MAQ consensus

233

238

model. Options -T, -N, -I and -r are only effective

234

239

when -c or -g is in use.

235

240

236

241

237

-g Generate genotype likelihood in the binary GLFv3

242

-g Generate genotype likelihood in the binary GLFv3

238

243

format. This option suppresses -c, -i and -s.

239

244

240

245

241

-T FLOAT The theta parameter (error dependency coefficient)

246

-T FLOAT The theta parameter (error dependency coefficient)

242

247

in the maq consensus calling model [0.85]

243

248

244

249

245

250

-N INT Number of haplotypes in the sample (>=2) [2]

246

251

247

252

248

-r FLOAT Expected fraction of differences between a pair of

253

-r FLOAT Expected fraction of differences between a pair of

249

254

haplotypes [0.001]

250

255

251

256

252

-I INT Phred probability of an indel in sequencing/prep.

257

-I INT Phred probability of an indel in sequencing/prep.

253

258

[40]

254

259

255

260

256

261

257

262

tview samtools tview <in.sorted.bam> [ref.fasta]

258

263

259

Text alignment viewer (based on the ncurses library). In the

260

viewer, press `?' for help and press `g' to check the align-

261

ment start from a region in the format like

264

Text alignment viewer (based on the ncurses library). In the

265

viewer, press `?' for help and press `g' to check the align-

266

ment start from a region in the format like

262

267

`chr10:10,000,000'.

263

268

264

269

265

266

270

fixmate samtools fixmate <in.nameSrt.bam> <out.bam>

267

271

268

272

Fill in mate coordinates, ISIZE and mate related flags from a

271

275

272

276

rmdup samtools rmdup <input.srt.bam> <out.bam>

273

277

274

Remove potential PCR duplicates: if multiple read pairs have

275

identical external coordinates, only retain the pair with

276

highest mapping quality. This command ONLY works with FR

278

Remove potential PCR duplicates: if multiple read pairs have

279

identical external coordinates, only retain the pair with

280

highest mapping quality. This command ONLY works with FR

277

281

orientation and requires ISIZE is correctly set.

278

282

279

283

280

281

284

rmdupse samtools rmdupse <input.srt.bam> <out.bam>

282

285

283

286

Remove potential duplicates for single-ended reads. This com-

284

mand will treat all reads as single-ended even if they are

287

mand will treat all reads as single-ended even if they are

285

288

paired in fact.

286

289

287

290

288

289

291

fillmd samtools fillmd [-e] <aln.bam> <ref.fasta>

290

292

291

Generate the MD tag. If the MD tag is already present, this

292

command will give a warning if the MD tag generated is dif-

293

Generate the MD tag. If the MD tag is already present, this

294

command will give a warning if the MD tag generated is dif-

293

295

ferent from the existing tag.

294

296

295

297

OPTIONS:

296

298

297

-e Convert a the read base to = if it is identical to

298

the aligned reference base. Indel caller does not

299

-e Convert a the read base to = if it is identical to

300

the aligned reference base. Indel caller does not

299

301

support the = bases at the moment.

300

302

301

303

302

304

303

305

SAM FORMAT

304

SAM is TAB-delimited. Apart from the header lines, which are started

306

SAM is TAB-delimited. Apart from the header lines, which are started

305

307

with the `@' symbol, each alignment line consists of:

306

308

307

309

342

344

+-------+--------------------------------------------------+

343

345

344

346

LIMITATIONS

345

o Unaligned words used in bam_import.c, bam_endian.h, bam.c and

347

o Unaligned words used in bam_import.c, bam_endian.h, bam.c and

346

348

bam_aux.c.

347

349

348

350

o CIGAR operation P is not properly handled at the moment.

349

351

350

o In merging, the input files are required to have the same number of

351

reference sequences. The requirement can be relaxed. In addition,

352

merging does not reconstruct the header dictionaries automatically.

353

Endusers have to provide the correct header. Picard is better at

352

o In merging, the input files are required to have the same number of

353

reference sequences. The requirement can be relaxed. In addition,

354

merging does not reconstruct the header dictionaries automatically.

355

Endusers have to provide the correct header. Picard is better at

354

356

merging.

355

357

356

358

o Samtools' rmdup does not work for single-end data and does not remove

358

360

359

361

360

362

AUTHOR

361

Heng Li from the Sanger Institute wrote the C version of samtools. Bob

363

Heng Li from the Sanger Institute wrote the C version of samtools. Bob

362

364

Handsaker from the Broad Institute implemented the BGZF library and Jue

363

Ruan from Beijing Genomics Institute wrote the RAZF library. Various

364

people in the 1000Genomes Project contributed to the SAM format speci-

365

Ruan from Beijing Genomics Institute wrote the RAZF library. Various

366

people in the 1000Genomes Project contributed to the SAM format speci-

365

367

fication.

366

368

367

369

370

372

371

373

372

374

373

samtools-0.1.6 2 September 2009 samtools(1)

375

samtools-0.1.7 10 November 2009 samtools(1)

Older »