~ubuntu-branches/ubuntu/oneiric/libxml-tokeparser-perl/oneiric

« back to all changes in this revision

Viewing changes to TokeParser.xml

Committer: Bazaar Package Importer
Author(s): Nathan Scott
Date: 2010-10-03 11:00:36 UTC
Revision ID: james.westby@ubuntu.com-20101003110036-s82cygco63wsfqp9

Tags: upstream-0.05

Import upstream version 0.05

files added:

Changes

MANIFEST

META.yml

Makefile.PL

README

TODO

TokeParser.pm

TokeParser.xml

t/1.normal.t

t/2.extended.t

Show diffs side-by-side

added added

removed removed

TokeParser.xml

<?xml version='1.0' encoding='iso-8859-1'?>

<head>

<title>XML::TokeParser - Simplified interface to XML::Parser</title>

</head>

<sect1>

<title>SYNOPSIS</title>

<verbatim><![CDATA[

use XML::TokeParser;

]]></verbatim>

<verbatim><![CDATA[

#parse from file

my $p=XML::TokeParser->new('file.xml')

]]></verbatim>

<verbatim><![CDATA[

#parse from open handle

open IN,'file.xml' or die $!;

my $p=XML::TokeParser->new(\*IN,Noempty=>1);

]]></verbatim>

<verbatim><![CDATA[

#parse literal text

my $text='<tag xmlns="http://www.omsdev.com">text</tag>';

my $p=XML::TokeParser->new(\$text,Namespaces=>1);

]]></verbatim>

<verbatim><![CDATA[

#read next token

my $token=$p->get_token();

]]></verbatim>

<verbatim><![CDATA[

#skip to <title> and read text

$p->get_tag('title');

$p->get_text();

]]></verbatim>

<verbatim><![CDATA[

#read text of next <para>, ignoring any internal markup

$p->get_tag('para');

$p->get_trimmed_text('/para');

]]></verbatim>

</sect1>

<sect1>

<title>DESCRIPTION</title>

<para>

XML::TokeParser provides a procedural ("pull mode") interface to XML::Parser

in much the same way that Gisle Aas' HTML::TokeParser provides a procedural

interface to HTML::Parser. XML::TokeParser splits its XML input up into

"tokens," each corresponding to an XML::Parser event.

</para>

<para>

A token is a reference to an array whose first element is an event-type

string and whose last element is the literal text of the XML input that

generated the event, with intermediate elements varying according to the

event type:

</para>

<list>

<item><itemtext>Start tag</itemtext>

<para>

The token has five elements: 'S', the element's name, a reference to a hash

of attribute values keyed by attribute names, a reference to an array of

attribute names in the order in which they appeared in the tag, and the

literal text.

</para>

</item>

<para>

The token has three elements: 'E', the element's name, and the literal text.

</para>

</item>

<item><itemtext>Character data (text)</itemtext>

<para>

The token has three elements: 'T', the parsed text, and the literal text.

All contiguous runs of text are gathered into single tokens; there will

never be two 'T' tokens in a row.

</para>

</item>

<item><itemtext>Comment</itemtext>

<para>

The token has three elements: 'C', the parsed text of the comment, and the

literal text.

</para>

</item>

<item><itemtext>Processing instruction</itemtext>

<para>

The token has four elements: 'PI', the target, the data, and the literal

text.

</para>

</item>

</list>

<para>

The literal text includes any markup delimiters (pointy brackets,

<![CDATA[, etc.), entity references, and numeric character references and

is in the XML document's original character encoding. All other text is in

UTF-8 (unless the Latin option is set, in which case it's in ISO-8859-1)

regardless of the original encoding, and all entity and character

references are expanded.

</para>

<para>

If the Namespaces option is set, element and attribute names are prefixed

by their (possibly empty) namespace URIs enclosed in curly brackets and

xmlns:* attributes do not appear in 'S' tokens.

100

</para>

101

</sect1>

102

<sect1>

103

<title>METHODS</title>

104

<list>

105

<item><itemtext>$p = XML::TokeParser->new($input, [options])</itemtext>

106

<para>

107

Creates a new parser, specifying the input source and any options. If

108

$input is a string, it is the name of the file to parse. If $input is a

109

reference to a string, that string is the actual text to parse. If $input

110

is a reference to a typeglob or an IO::Handle object corresponding to an

111

open file or socket, the text read from the handle will be parsed.

112

</para>

113

<para>

114

Options are name=>value pairs and can be any of the following:

115

</para>

116

</item>

117

<list>

118

<item><itemtext>Namespaces</itemtext>

119

<para>

120

If set to a true value, namespace processing is enabled.

121

</para>

122

</item>

123

<item><itemtext>ParseParamEnt</itemtext>

124

<para>

125

This option is passed on to the underlying XML::Parser object; see that

126

module's documentation for details.

127

</para>

128

</item>

129

<item><itemtext>Noempty</itemtext>

130

<para>

131

If set to a true value, text tokens consisting of only whitespace (such as

132

those created by indentation and line breaks in between tags) will be

133

ignored.

134

</para>

135

</item>

136

<item><itemtext>Latin</itemtext>

137

<para>

138

If set to a true value, all text other than the literal text elements of

139

tokens will be translated into the ISO 8859-1 (Latin-1) character encoding

140

rather than the normal UTF-8 encoding.

141

</para>

142

</item>

143

<item><itemtext>Catalog</itemtext>

144

<para>

145

The value is the URI of a catalog file used to resolve PUBLIC and SYSTEM

146

identifiers. See XML::Catalog for details.

147

</para>

148

</item>

149

</list>

150

<item><itemtext>$token = $p->get_token()</itemtext>

151

<para>

152

Returns the next token, as an array reference, from the input. Returns

153

undef if there are no remaining tokens.

154

</para>

155

</item>

156

<item><itemtext>$p->unget_token($token,...)</itemtext>

157

<para>

158

Pushes tokens back so they will be re-read. Useful if you've read one or

159

more tokens to far.

160

</para>

161

</item>

162

<item><itemtext>$token = $p->get_tag( [$token] )</itemtext>

163

<para>

164

If no argument given, skips tokens until the next start tag or end tag

165

token. If an argument is given, skips tokens until the start tag or end tag

166

(if the argument begins with '/') for the named element. The returned

167

token does not include an event type code; its first element is the element

168

name, prefixed by a '/' if the token is for an end tag.

169

</para>

170

</item>

171

<item><itemtext>$text = $p->get_text( [$token] )</itemtext>

172

<para>

173

If no argument given, returns the text at the current position, or an empty

174

string if the next token is not a 'T' token. If an argument is given,

175

gathers up all text between the current position and the specified start or

176

end tag, stripping out any intervening tags (much like the way a typical

177

Web browser deals with unknown tags).

178

</para>

179

</item>

180

<item><itemtext>$text = $p->get_trimmed_text( [$token])</itemtext>

181

<para>

182

Like get_text(), but deletes any leading or trailing whitespaces and

183

collapses multiple whitespace (including newlines) into single spaces.

184

</para>

185

</item>

186

</list>

187

</sect1>

188

<sect1>

189

<title>DIFFERENCES FROM HTML::TokeParser</title>

190

<para>

191

Uses a true XML parser rather than a modified HTML parser.

192

</para>

193

<para>

194

Text and comment tokens include extracted text as well as literal text.

195

</para>

196

<para>

197

PI tokens include target and data as well as literal text.

198

</para>

199

<para>

200

No tokens for declarations.

201

</para>

202

<para>

203

No "textify" hash.

204

</para>

205

</sect1>

206

<sect1>

207

<title>EXAMPLES</title>

208

<sect2>

209

<title>Print method signatures from the XML version of this PODpage</title>

210

<verbatim><![CDATA[

211

#!/usr/bin/perl -w

212

use strict;

213

use XML::TokeParser;

214

my $t;

215

my $p=XML::TokeParser->new('tokeparser.xml',Noempty=>1) or die $!;

216

while ($p->get_tag('title') && $p->get_text('/title') ne 'METHODS') {

217

;

218

}

219

$p->get_tag('list');

220

while (($t=$p->get_tag()->[0]) ne '/list') {

221

if ($t eq 'item') {

222

$p->get_tag('itemtext');

223

print $p->get_text('/itemtext'),"\n";

224

$p->get_tag('/item');

225

}

226

else {

227

$p->get_tag('/list'); # assumes no nesting here!

228

}

229

}

230

]]></verbatim>

231

</sect2>

232

</sect1>

233

<sect1>

234

<title>AUTHOR</title>

235

<para>

236

Eric Bohlman (ebohlman@omsdev.com)

237

</para>

238

<para>

239

240

is free software; you can redistribute it and/or modify it under the same

241

terms as Perl itself.

242

</para>

243

</sect1>

244

<sect1>

245

246

<verbatim><![CDATA[

247

XML::Parser

248

XML::Catalog

249

HTML::TokeParser

250

]]></verbatim>

251

</sect1>

252

</pod>

Older »