1
\' This file is part of TagSoup and is Copyright 2002-2008 by John Cowan.
3
\' TagSoup is licensed under the Apache License,
4
\' Version 2.0. You may obtain a copy of this license at
5
\' http://www.apache.org/licenses/LICENSE-2.0 . You may also have
6
\' additional legal rights not granted by this license.
8
\' TagSoup is distributed in the hope that it will be useful, but
9
\' unless required by applicable law or agreed to in writing, TagSoup
10
\' is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
11
\' OF ANY KIND, either express or implied; not even the implied warranty
12
\' of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
14
.TH TAGSOUP "1" "January 2008" "TagSoup 1.2" "User Commands"
16
tagsoup \- convert nasty, ugly HTML to clean XHTML
18
.B java -jar tagsoup-1.2
25
.\" Add any additional description here
27
Rectify arbitrary HTML into clean XHTML,
28
using a tailored description of HTML.
29
The output will be well-formed XML, but not necessarily
37
should be processed into corresponding output files
39
.BI --encoding= encoding
40
specifies the encoding of input files
42
.BI --output-encoding= encoding
43
specifies the encoding of the output
44
(if the encoding name begins with ``utf'',
45
the output will not contain character entities;
46
otherwise, all non-ASCII characters are
47
represented as entities)
50
output rectified HTML rather than XML,
51
omitting the XML declaration
52
and any namespace declarations
55
output rectified HTML rather than XML
56
(end-tags are omitted for empty elements,
57
and no character escaping is done in
58
script and style elements)
60
.B --omit-xml-declaration
61
omit the XML declaration
64
output lexical features (specifically comments and any DOCTYPE declaration)
67
suppress namespaces in output
70
suppress unknown non-HTML elements in output
73
suppress default attribute values
76
change explicit colons
77
in element and attribute names
81
don't restart any restartable elements
84
pass through ignorable whitespace
85
(whitespace in element-only content)
86
via SAX method handler ignorableWhitespace
89
treat unknown non-HTML elements as allowing any content (default)
92
treat unknown non-HTML elements as empty elements
95
don't allow unknown non-HTML elements to be root elements
97
.BI --doctype-system= system-id
98
force DOCTYPE declaration to be output with specified system identifier
100
.BI --doctype-public= public-id
101
force DOCTYPE declaration to be output with specified public identifier
103
.B --standalone=[yes|no]
104
specify standalone pseudo-attribute in output XML declaration
106
.BI --version= version
107
specify version pseudo-attribute in output XML declaration
108
(does not affect actual version of XML output)
111
treat the CDATA-content elements
119
output PYX format rather than XML
123
input is PYX-format HTML
127
reuse the same Parser object internally
134
output version number
137
is a parser and reformatter for nasty, ugly HTML.
138
Its normal processing mode is to accept HTML files on the command line,
139
or from the standard input if none are given, and output them
141
to the standard output. The encoding is assumed to be the platform-local
142
encoding on input, and is always UTF-8 on output.
146
option is given, each input file is processed into an output file of the
147
corresponding name, with the extension changed to
149
If the extension is already
154
TagSoup will repair, by whatever means necessary,
155
violations of XML well-formedness. In particular, it will fix up
156
malformed attribute names and supply missing attribute-value quotation marks.
157
More significantly, it supplies end-tags where HTML allows them
158
to be omitted, and sometimes where it doesn't. It will even supply
159
start-tags where necessary; for example, if a document begins with a
160
<li> tag, TagSoup will automatically prefix it with <html><body><ul>.
163
TagSoup can be fooled by missing close quotes after attribute values, and by
164
incorrect character encodings (it does not contain an encoding guesser).
166
TagSoup doesn't understand namespace declarations, which are not properly
167
part of HTML. Instead, any element or attribute name beginning
169
will be put into the artificial namespace
170
.RI urn:x-prefix: foo .
172
For the same reasons, namespace-qualified attributes like
174
can't be returned as default values,
175
though an explicit attribute in the xml namespace
176
will be returned with the proper namespace URI.
178
John Cowan <cowan@ccil.org>
180
Copyright \(co 2002-2008 John Cowan
182
TagSoup is free software; see the source for copying conditions. There is NO
183
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.