1
<?xml version="1.0" standalone="no"?>
2
<!DOCTYPE s1 SYSTEM "../../style/dtd/document.dtd">
4
* Copyright 2001-2004 The Apache Software Foundation.
6
* Licensed under the Apache License, Version 2.0 (the "License");
7
* you may not use this file except in compliance with the License.
8
* You may obtain a copy of the License at
10
* http://www.apache.org/licenses/LICENSE-2.0
12
* Unless required by applicable law or agreed to in writing, software
13
* distributed under the License is distributed on an "AS IS" BASIS,
14
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15
* See the License for the specific language governing permissions and
16
* limitations under the License.
18
<!-- $Id: xsl_whitespace_design.xml,v 1.2 2009/12/10 03:18:38 matthewoliver Exp $ -->
19
<s1 title="<xsl:strip/preserve-space>">
22
<li><link anchor="functionality">Functionality</link></li>
23
<li><link anchor="identify">Identifying strippable whitespace nodes</link></li>
24
<li><link anchor="which">Determining which nodes to strip</link></li>
25
<li><link anchor="strip">Stripping nodes</link></li>
26
<li><link anchor="filter">Filtering whitespace nodes</link></li>
29
<anchor name="functionality"/>
30
<s2 title="Functionality">
32
<p>The <code><xsl:strip-space></code> and <code><xsl:preserve-space></code>
33
elements are used to control the way whitespace nodes in the source XML
34
document are handled. These elements have no impact on whitespace in the XSLT
35
stylesheet. Both elements can occur only as top-level elements, possible more
36
than once, and the elements are always empty</p>
38
<p>Both elements take one attribute "elements" which contains a
39
whitespace separated list of named nodes which should be or preserved
40
stripped from the source document. These names can be on one of these three
41
formats (NameTest format):</p>
46
<code>elements="*"</code>
49
All whitespace nodes with a namespace:
50
<code>elements="<namespace>:*"</code>
53
Specific whitespace nodes: <code>elements="<qname>"</code>
57
</s2><anchor name="identify"/>
58
<s2 title="Identifying strippable whitespace nodes">
60
<p>The DOM detects all text nodes and assigns them the type <code>TEXT</code>.
61
All text nodes are scanned to detect whitespace-only nodes. A text-node is
62
considered a whitespace node only if it consist entirely of characters from
63
the set { 0x09, 0x0a, 0x0d, 0x20 }. The DOM implementation class has a static
64
method used to detect such nodes:</p>
67
private static final boolean isWhitespaceChar(char c) {
68
return c == 0x20 || c == 0x0A || c == 0x0D || c == 0x09;
72
<p>The characters are checked in probable order.</p>
74
<p> The DOM has a bit-array that is used to tag text-nodes as strippable
77
<source>private int[] _whitespace;</source>
79
<p>There are two methods in the DOM implementation class for accessing this
80
bit-array: <code>markWhitespace(node)</code> and <code>isWhitespace(node)</code>.
81
The array is resized together with all other arrays in the DOM by the
82
<code>DOM.resizeArrays()</code> method. The bits in the array are set in the
83
<code>DOM.maybeCreateTextNode()</code> method. This method must know whether
84
the current node is a located under an element with an
85
<code>xml:space="<value>"</code> attribute in the DOM, in which
86
case it is not a strippable whitespace node.</p>
88
<p>An auxillary class, WhitespaceHandler, is used for this purpose. The class
89
works in a way as a stack, where you "push" a new strip/preserve setting
90
together with the node in which this setting was determined. This means that
91
for every time the DOM builder encounters an <code>xml:space</code> attribute
92
it will invoke a method on an instance of the WhitespaceHandler class to
93
signal that a new preserve/strip setting has been encountered. This is done
94
in the <code>makeAttributeNode()</code> method. The whitespace handler stores the
95
new setting and pushes the current element node on its stack. When the
96
DOM builder closes up an element (in <code>endElement()</code>), it invokes
97
another method of the whitespace handler to check if the strip/preserve
98
setting is still valid. If the setting is now invalid (we're closing the
99
element whose node id is on the top of the stack) the handler inverts the
100
setting and pops the element node id off the stack. The previous
101
strip/preserve setting is then valid, and the id of node where this setting
102
was defined is on the top of the stack.</p>
104
</s2><anchor name="which"/>
105
<s2 title="Determining which nodes to strip">
107
<p>A text node is never stripped unless it contains only whitespace
108
characters (Unicode characters 0x09, 0x0A, 0x0D and 0x20). Stripping a text
109
node means that the node disappears from the DOM; so that it is never
110
included in the output and that it is ignored by all functions such as
111
<code>count()</code>. A text node is preserved if any of the following apply:</p>
115
the element name of the parent of the text node is in the set of
116
elements listed in <code><xsl:preserve-space></code>
119
the text node contains at least one non-whitespace character
122
an ancenstor of the whitespace text node has an attribute of
123
<code>xsl:space="preserve"</code>, and no close ancestor has and
124
attribute of <code>xsl:space="default"</code>.
128
<p>Otherwise, the text node is stripped. Initially the set of
129
whitespace-preserving element names contains all element names, so the
130
default behaviour is to preserve all whitespace text nodes.</p>
132
<p>This seems simple enough, but resolving conflicts between matching
133
<code><xsl:strip-space></code> and <code><xsl:preserve-space></code>
134
elements requires a lot of thought. Our first consideration is import
135
precedence; the match with the highest import precedence is always chosen.
136
Import precedence is determined by the order in which the compared elements
137
are visited. (In this case those elements are the top-level
138
<code><xsl:strip-space></code> and <code><xsl:preserve-space></code>
139
elements.) This example is taken from the XSLT recommendation:</p>
142
<li>stylesheet A imports stylesheets B and C in that order;</li>
143
<li>stylesheet B imports stylesheet D;</li>
144
<li>stylesheet C imports stylesheet E.</li>
147
<p>Then the order of import precedence (lowest first) is D, B, E, C, A.</p>
149
<p>Our next consideration is the priority of NameTests (XPath spec):</p>
152
<code>elements="<qname>"</code> has priority 0
155
<code>elements="<namespace>:*"</code> has priority -0.25
158
<code>elements="*"</code> has priority -0.5
162
<p>It is considered an error if the desicion is still ambiguous after this,
163
and it is up to the implementors to decide what the apropriate action is.</p>
165
<p>With all this complexity, the normal usage for these elements is quite
166
smiple; either preserve all whitespace nodes but one type:</p>
168
<source><xsl:strip-space elements="foo"/></source>
170
<p>or strip all whitespace nodes but one type:</p>
173
<xsl:strip-space elements="*"/>
174
<xsl:preserve-space elements="foo"/></source>
176
</s2><anchor name="strip"/>
177
<s2 title="Stripping nodes">
179
<p>The ultimate goal of our design would be to totally screen all stripped
180
nodes from the translet; to either physically remove them from the DOM or to
181
make it appear as if they are not there. The first approach will cause
182
problems in cases where multiple translets access the same DOM. In the future
183
we wish to let translets run within servlets / JSPs with a common DOM cache.
184
This DOM cache will keep copies of DOMs in memory to prevent the same XML
185
file from being downloaded and parsed several times. This is a scenarios we
188
<p><img src="DOMInterface.gif" alt="DOMInterface.gif"/></p>
189
<p><ref>Figure 1: Multiple translets accessing a common pool of DOMs</ref></p>
191
<p>The three translets running on this host access a common pool of 4 DOMs.
192
The DOMs are accessed through a common DOM interface. Translets accessing
193
a single DOM will have a DOMAdapter and a single DOMImpl object behind this
194
interface, while translets accessing several DOMs will be given a MultiDOM
195
and a set of DOMImpl objects.</p>
197
<p>The translet to the left may want to strip some nodes from the shared DOM
198
in the cache, while the other translets may want to preserve all whitespace
199
nodes. Our initial thought then is to keep the DOM as it is and somehow
200
screen the left-hand translet of all the whitespace nodes it does not want to
201
process. There are a few ways in which we can accomplish this:</p>
205
The translet can, prior to starting to traverse the DOM, send a reference
206
to the tables containing information on which nodes we want stripped to
207
the DOM interface. The DOM interface is then responsible for hiding all
208
stripped whitespace nodes from the iterators and the translet. A problem
209
with this approach is that we want to omit the DOM interface layer if
210
the translet is only accessing a single DOM. The DOM interface layer will
211
only be instanciated by the translet if the stylesheet contained a call
212
to the <code>document()</code> function.<br/><br/>
215
The translet can provide its iterators with information on which nodes it
216
does not want to see. The translet is still shielded from unwanted
217
whitespace nodes, but it has the hassle of passing extra information over
218
to most iterators it ever instanciates. Note that all iterators do not
219
need be aware of whitepspace nodes in this case. If you have a look at
220
the figure again you will see that only the first level iterator (that is
221
the one closest to the DOM or DOM interface) will have to strip off
222
whitespace nodes. But, there may be several iterators that operate
223
directly on the DOM ( invoked by code handling XSL functions such as
224
<code>count()</code>) and every single one of those will need to be told
225
which whitespace nodes the translet does not want to see.<br/><br/>
228
The third approach will take advantage of the fact that not all
229
translets will want to strip whitespace nodes. The most effective way of
230
removing unwanted whitespace nodes is to do it once and for all, before
231
the actual traversal of the DOM starts. This can be done by making a
232
clone of the DOM with exlusive-access rights for this translet only. We
233
still gain performance from the cache because we do not have to pay the
234
cost of the delay caused by downloading and parsing the XML source file.
235
The cost we have to pay is the time needed for the actual cloning and the
236
extra memory we use.<br/><br/>
237
Normally one would imagine the translet (or the wrapper class that
238
invokes the translet) calls the DOM cache with just an URL and receives
239
a reference to an instanciated DOM. The cache will either have built
240
this DOM on-demand or just passed back a reference to an existing tree.
241
In this case the DOM would need an extra call that a translet would use
242
to clone a DOM, passing the existing DOM reference to the cache and
243
recieving a new reference to the cloned DOM. The translet can then do
244
whatever it wants with this DOM (the cache need not even keep a reference
249
<p>We are lucky enough to be able to combine the first two approaches. All
250
iterators that directly access the DOM (axis iterators) are instanciated by
251
calls to the DOM interface layer (the DOM class). The actual iterators are
252
created in the DOM implementation layer (the DOMImpl class). So, we can pass
253
references to the preserve/strip whitespace tables to the DOM, and the DOM
254
will make sure that all axis iterators return node sets with respect to these
256
</s2><anchor name="filter"/>
257
<s2 title="Filtering whitespace nodes">
259
<p>For each axis iterator and for <code>DOM.makeStringValue()</code> and
260
<code>DOM.stringValueAux()</code> we must apply a filter for eliminating all
261
unwanted whitespace nodes. To achive this we must build a very efficient
262
predicate for determining if the current node should be stripped or not. This
263
predicate is built by <code>Whitespace.compilePredicate()</code>. This method is
264
static and builds a predicate for a vector of WhitespaceRule objects. (The
265
WhitespaceRule class is defined within the Whitespace class.) Each
266
WhitespaceRule object contains information for a single element listed
267
in an <code><xsl:strip/preserve-space></code> element, and is broken down
268
into the following information:</p>
271
<li>the namespace (can be the default namespace)</li>
272
<li>the element name or "<code>*</code>"</li>
273
<li>the type of rule; NS:EL, NS:<code>*</code> or <code>*</code></li>
274
<li>the priority of the rule (based on import precedence and type)</li>
275
<li>the action; either strip or preserver</li>
278
<p>The Vector of WhitespaceRules is arranged in order of priority and
279
redundant rules are removed. A predicate method is then compiled into the
283
public boolean stripSpace(int node);
286
<p>Unfortunately this method cannot be declared static.</p>
288
<p>When the Stylesheet objectcompiles the <code>topLevel()</code> method of the
289
translet it checks for the existance of the <code>stripSpace()</code> method. If
290
this method exists the <code>topLevel()</code> will be compiled to pass the
291
translet to the DOM as a StripWhitespaceFilter (the translet implements this
292
interface when the <code>stripSpace()</code> method is compiled).</p>
294
<p>All axis iterators and the <code>DOM.makeStringValue()</code> and
295
<code>DOM.stringValueAux()</code> methods check for the existance of this filter
296
(it is kept in a global variable in the DOM implementation class) and takes
297
the appropriate actions. The methods in the DOM for returning axis iterators
298
will place a StrippingIterator on top of the axis iterator if the filter is
299
present, and the two methods just mentioned will return empty strings for
300
whitespace nodes that should be stripped.</p>