1
<?xml version="1.0" encoding="UTF-8"?>
3
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
4
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
5
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
7
<title>wvWare 2 Design Document</title>
8
<link rel="stylesheet" href="Steely" type="text/css" />
11
<h1>wvWare 2 Design Document</h1>
15
<li>The filter should depend on as few libraries as possible. Right now it depends
16
on <b>libgsf-1</b>, which in turn depends on <b>glib-2</b> and <b>libxml2</b>.
17
Additionally wv2 needs a working <b>iconv</b> installation (which is present on
18
almost all systems).<br/>
19
Apart from that we need a working C++ compiler (e.g. gcc-2.95.2 or newer; egcs-2.91.66
20
seems to compile it, too. I didn't test whether it actually works, though).
21
Developers might also need Perl 5 or newer (to regenerate the code),
22
reasonably new versions of automake, autoconf, and libtool. If you have some
23
Doxygen version you can generate the API documentation yourself, using the
24
doxyfile in wv2/doc.<br/>
25
To use the regression test script you need a working Python installation.
27
<li>We try to get the filter as portable as possible (without major pain).
28
It has been tested on various architectures like x86, PPC, Alpha, and Sparc
29
and on several operating systems. The library compiles without any problems
30
with recent gcc versions, with Intel's C++ compiler, and MS Visual C++ 7.
31
Unfortunately I wasn't able to test it with other compilers, but any modern
32
compiler with a working STL should do.
34
<li>To allow different use cases (like importing a file to some word processor
35
or just outputting a bit of plain text for a preview) we try to provide a
36
layered API (low level/high complexity vs. high level). Right now we concentrate
37
on the low level API, later on we'll add the high level interfaces.
39
<li>Albeit the Word file format is crazy and broken we still aim to have
40
readable and sane code. The API should be easy to use and well documented (using
41
Doxygen tags). We also fix the buggy SPEC whenever we detect bugs or ambiguities.
42
If we are not sure about our changes and fixes of the SPEC we don't change the
43
HTML file, but add an entry to src/generator/spec_defects.
45
<li>The "old" WinWord filter of KOffice reads the information "just in time"
46
which makes the code a bit crowded and hard to follow. wv2 reads the important
47
information like stylesheets, list information,... <i>before</i> we start to process
48
the main document body. This way the code should get a bit simpler (but we'll
49
use a bit more memory while parsing the document).
51
<li>We assume the filter is used as in-process component and the API is designed
52
for this purpose. In order to make it possible to dlopen/dlclose the filter
53
we try not to use lots of static objects (as there are problems with destructing
54
them on unloading). Please inform us if you experience problems in that area,
55
it's very likely that we can fix them.
57
<li>For all important components of this library self-checking unit and function tests
58
are recommended. It's also recommended to check the code using Julian Seward's tool
59
<a href="http://developer.kde.org/~sewardj/">valgrind</a>. Additionally the library
60
comes with an automatic regression test utility in the <code>tests</code> directory.
62
<li>We aim to reuse code and share the parser/utility code between parsers for different
63
Word versions. This goal mainly is a design task and requires a lot of knowledge
66
<li>The library is <b>not</b> designed to work in a multi-threaded environment. If you
67
need non-blocking filtering you will have to use a separate process (note that
68
I'm no threading expert, but I definitely know that a thread safe library doesn't just
73
<h2>Directory Structure</h2>
74
<p>Not a huge package, but in case you get lost:</p>
78
<td><b>Contents</b></td>
82
<td>Holds some build system stuff and general build information.</td>
86
<td>Here we keep some information for developers and a Doxygen file to generate
87
the API documentation.
92
<td>Contains 99% of the sources. As we don't want to have a build-time
93
dependency on Perl we also added the generated Code to the CVS tree.
97
<td>wv2/src/generator</td>
98
<td>Two Perl scripts, some template files, and the available file format specification
99
for Word 8 and Word 6. This stuff generates the scanner code. If you finished reading
100
this document you might want to check out the file format spec in this directory.
105
<td>Mainly self checking unit tests and function tests for the library. Use "make check" to build them.
110
<h2>Design Overview</h2>
111
<p>Viewed from far, far away the filter structure somehow looks like that:</p>
112
<p style="text-align:center"><img src="arch.png" alt="Architecture" width="884" height="390" /></p>
113
<p>A Word document consists of a number of streams, embedded in one file. This file-system-in-a-file
114
is called OLE structured storage. We're using libgsf to get hold of the real data. The filter
115
itself consists of some central "intelligence" to perform the needed steps to parse the document
116
and some utility classes to support that task. During the parsing process we send the gathered information
117
to the consumer, the program loading the Word file (on the right). This program has to process the
118
delivered information and either assemble a native file or stream the contents directly to the application.
122
<p>The interface to the documents is a C++ wrapper around the libgsf library. libgsf allows --
123
among many other things -- to read and write OLE streams from and to the document file.
124
It would be rather inconvenient to use it directly, so we created a class representing the
125
whole document (<code>OLEStorage</code>), and two classes for reading and writing a single
126
stream (<code>OLEStreamReader</code> and <code>OLEStreamWriter</code>).
128
<p><code>OLEStorage</code> holds the state of the document and allows to travel through the
129
"directories." It also provides methods to create <code>OLEStreamReader</code> and
130
<code>OLEStreamWriter</code> objects on the document.
132
<p><code>OLEStream</code> is the base class for <code>OLEStreamReader</code> and
133
<code>OLEStreamWriter</code>, providing the common functionality, like seeking in the stream
134
and pushing and popping the current cursor position.<br/>
135
The <code>OLEStreamReader/Writer</code> classes provide a stream-based API, although we don't
136
use the stream operators (operator<< and operator>>). Using the stream operators
137
would be very inconvenient, as we often would have to specify the exact type we want to read
138
or write to/from a variable of a different type.
140
<p>This part of the code contained in the ole* files is generally straightforward, but as libgsf is a lot
141
stricter than libole2 some of the functionality is gone (e.g. you can't browse the contents of a
142
directory in a file you write out, you can't open an OLE storage for reading and writing,...).
146
<p>The external API for the users of the library should consist of at least two, but maybe more,
147
layers. Ranging from a low level and fine grained API where lots of work is needed on the
148
consumer side (with the benefit of high flexibility and enormous amounts of information) to a
149
very high level API, basically returning enriched text, at the cost of flexibility.
151
<p>Another main task of that API is to hide differences between Word versions if that's feasible.
152
In any case even the low level layer of the API shouldn't expose too much of the ugliness of
153
Word documents. For the time being we chose to make every document look like it's a Word 8
154
(aka Word 97) one to the consumer. For Word 6 or newer this seems to work, and I think it's
155
possible to do the same for older Word versions. In the unlikely case that Microsoft releases
156
a more recent file format specification (e.g. the specification for Word 2002) we should
157
think about "updating" the API, to provide as much information as possible to the consumer.
159
<p>Technically the API is a mixture of a good old "Hollywood Principle" API
160
(<a href="http://www.research.ibm.com/designpatterns/pubs/ph-feb96.txt">Don't call us; we'll call you</a>)
161
and a fancy <a href="http://www.gotw.ca/gotw/083.htm">functor-based approach</a>. The Hollywood part of the
162
API can be found in the handler.h file, it's split across several smaller interfaces. We are incrementally
163
adding/moving/removing functionality there, so please don't expect that API to be stable, yet.
165
<p>The main reason to choose this approach is that the very common callbacks like <code>TextHandler::runOfText</code>
166
are as lightweight as possible. More complex callbacks like <code>TextHandler::headersFound</code> allow a good
167
deal of flexibility in parsing, as the consumer decides <i>when</i> to parse e.g. the header (also known as stored
168
command). This helps to avoid nasty hacks if the concepts of the destination file format differ from the
169
MS Word ones. The consumer just stores the functor objects and executes them whenever it feels like. For
170
an example please refer to the KOffice MS Word filter in <code>koffice/filters/kword/msword</code>.
174
<p>The core part of the whole filter. This part of the code ensures that the utility classes are
175
used in the correct order and manages the communication between various parts of the library.
176
It's also quite challenging to design this part of the code. Various versions contain similar
177
or even identical chunks, but other parts differ a lot. The aim is to find a design which allows
178
to reuse much of the parser code for several versions.
180
<p>Right now it seems that we found a nice mixture of plain interfaces with virtual methods and
181
fancy functor-like objects for more complex structures like footnote information. The advantage
182
of this mixture is, that common operations are reasonably fast (just a virtual method call) and
183
yet we provide enough flexibility for the consumer to trigger the parsing of the more complex
184
structures itself. This means that you can easily cope with different concepts in the file formats
185
by delaying the parsing of, say, headers and footers till after you read all the main body text.
187
<p>This flexibility of course isn't free of costs, but the functor concept is pretty lightweight,
188
totally typesafe, and it allows to hide parts of the parser API. I'd like to hear your opinions
191
<p>The main task in the parser section is to find a design which allows to share the common code between different
192
file format versions. Another important task is to keep the coupling of the code reasonably low. I see a lot
193
of places in the specification where information from various blocks of our design is needed, and I really hate
194
code where every object holds 5 pointers to other objects just because it needs to query some information from
195
every of these objects once in its lifetime. Code like that is a pain to maintain.
197
<p>For the code sharing topic the current solution is a small hierarchy of <code>Parser*</code> classes like
200
<p style="text-align:center"><img src="parsers.png" alt="Hierarchy" width="276" height="262" /></p>
201
<p><code>Parser</code> is an abstract base class providing a few methods to start the parsing and so on. This
202
is the interface the outside world sees and uses. <code>Parser9x</code> derives from that base class and
203
implements the common parsing code for Word 6, Word 7, and Word 8. Whenever these versions need a different
204
handling there are two possibilities: smaller differences are solved via a conditional expression or a
205
if-else construct, bigger differences are solved by an abstract virtual method in <code>Parser9x</code> and
206
the appropriate implementation in <code>Parser97</code> and <code>Parser95.</code><br/>
207
Therefore <code>Parser9x</code> does the main work. It's hard to argue that this is a normal Is-A inheritance,
208
but with a little bit of phantasy it's pretty close.
210
<p>The whole parsing process is divided into different stages and all this code is chopped into nice little pieces
211
and put into various helper/template methods. We take care to separate methods in a way that as many of them as
212
possible can be "bubbled up" the inheritance hierarchy right to <code>Parser9x</code> or even
215
<p>To keep the coupling between the blocks of the design low the parser has to implement the Mediator pattern
216
or something similar. It is the only block in our design containing "intelligence" in the sense
217
that it's the only block knowing about the sequence of parsing and the interaction of the encapsulated
218
components like the OLE subsystem and the stylesheet-handling utility classes.
221
<h4>String Classes</h4>
222
<p>We agreed to use Harri Porten's <code>UString</code> class from kjs, a clean implementation of
223
an implicitly shared UCS-2 string class (host order Unicode values). In the same file (ustring.h)
224
there's also a <code>CString</code> class, but we'll use <code>std::string</code> for ASCII strings.
226
<p>The iconv library is used to convert text stored as CP 1252 or similar to UCS-2. This is done by
227
the <code>Textconverter</code> class, which wraps libiconv. Some systems ship a broken/incomplete
228
version of libiconv (e.g. Darwin, older Solaris versions,...), so we have a configure option
229
<code>--with-iconv-dir</code> to specify the path of alternative iconv installations.
231
<p>The main classes <code>UString</code> and <code>std::string</code> are well tested and known to work well.
232
Take a lot of care when using <code>UString::ascii</code>, though. The buffer for the ASCII
233
string is shared among all instances of <code>UString</code> (static buffer)! As we need that method for
234
debugging only this is no problem. <code>UString</code> is implicitly shared, so copying strings is rather
235
cheap as long as you don't modify them (copy on write semantics).
237
<p>Older Word versions don't store the text as Unicode strings but encoded using some codepage like CP 1252.
238
libiconv helps us to convert all these encodings to UCS-2 (sloppy: 16bit Unicode). We don't use libiconv
239
directly from within the library, but we use a small wrapper class (<code>Textconverter</code>) for convenience.
242
<h4>Utility Classes</h4>
243
<p>To reduce the complexity of the code we try to write small entities designed to do one specific,
244
encapsulated task (e.g. all the code in styles.cpp is used to read the stylesheet information contained in
245
every Word file, lists.cpp cares about -- surprise -- lists,...). These classes are, IMHO, the key to
246
clean code. Classes for the programming infrastructure like the <code>SharedPtr</code> class also belong
247
to this category.<br/>
248
We use a certain naming scheme to distinguish code which works for all versions (at least
249
Word 6 and newer) or just for one specific category. All the *97.(cpp|h) files are designed
250
to work with Word 8 or newer, files without such a number should work with all versions (note
251
that there are some exceptions to that rule, e.g. <code>Properties97</code> as I was too lazy
252
to mess around with the files in CVS, losing the history).
254
<p>This part of the code also consists of a number of templates to handle the different ways
255
arrays and more complex structures are stored in a Word file (e.g. the meta structures PLF, PLCF,
256
and FKP). If that sounds like Greek to you it's probably a good idea to read the Definitions
257
section at the top of the file format specification in wv2/src/generator.
260
<h4>Generated Scanner Code</h4>
261
<p>It's a tedious job to implement the most basic part of the filter -- reading and writing the
262
structures in Word documents. It is boring, repetitious, error prone, so we decided to <i>generate</i>
263
this ultra-low level code. We're using two Perl scripts and the available HTML specification
264
for Word 8 and Word 6. One script called <code>generate.pl</code> is used to scan the HTML file
265
and output the reading/writing code and some test files. The other script, <code>convert.pl</code>
266
generates code to convert Word 6 to Word 8 structures. We need to do this, because we want to
267
present the files as Word 8 files to the outside world. The idea behind that is to hide all the
268
subtle differences between the formats from the user of this library. For Word 6 this seems to
269
be possible, no idea if that will work out for older formats.
271
<p>The generated code mentioned above consists of several thousand lines of code. The design of this
272
code is non-existent, it's just a number of structures supporting reading, writing, copying, assignment,
273
and so on. Some of the structures are partly generated only (like the <code>apply()</code> method of the main
274
property structures like <code>PAP</code>, <code>CHP</code>, <code>SEP</code>, and others). Some structures are
275
commented out, as it would be too hard to generate them. These few structures have to be written manually if
278
<p>Generally we just parse the specification to get the information out, but sometimes we need a few hints from the
279
programmer to know what to do. These hints are mostly given by adding special comments to the HTML specification.
280
For further information on these hints, and on the available tricks, please have a look at the top of the Perl
281
scripts. The comments are quite detailed and it should be easy to figure out what I intend to do with the hints.
283
<p>Another way to influence the generated code is to manipulate certain parts of the script itself. You need to do
284
that to change the order of the structures in the file, disable a structure completely and so on. You can also
285
select structures to derive from the <code>Shared</code> class to be able to use the structure with the
286
<code>SharedPtr</code> class.
288
<p>The whole file might need some minor tweaking, a license, <code>#includes</code>, and maybe even some declarations
289
or code. This is what the template files in wv2/src/generator are for -- the code gets copied verbatim into
290
the generated file. Never manipulate a generated file, all your changes will be lost when the code is regenerated!
292
<p>If you think you found a bug in the specification you can try to correct the HTML file and regenerate the scanner
293
code using the command <code>make generated</code>. In case you aren't satisfied with the resulting C++ code, or
294
if you found a bug in the scripts please contact me. If you aren't scared by a bit of Perl code feel free to fix
295
or extend the code yourself.
297
<p>Please note that using the C++ <code>sizeof()</code> operator on these structures is dangerous. You should never
298
rely on their memory layout. The reason for that is that the structures in the Word file are
299
"packed", this means there are no padding and alignment bytes between variables. In our generated code
300
we can't achieve that in a portable manner, so we decided not to use it at all. Due to that reading the whole
301
structure in at once doesn't even work on little endian platforms, let alone big endian machines. The solution are
302
the generated <code>read()</code> methods. In case you need to know the in-file size of a Word structure, you can
303
add a <code>sizeOf</code> variable in the HTML spec (please check the code generation script for more information).
304
It should be obvious that casting memory chunks from a Word file to structures or casting among different structures
305
is also a bad idea. If you really want to create a certain structure from some memory block, please add a
306
<code>readPtr</code> special-comment in the HTML spec.
310
<p>A vital part of the whole library are self-checking unit and function tests, to avoid introducing
311
hard to find bugs while implementing new features. The goal is to test the major components, but
312
it's close to impossible to test everything. Please run the unit tests before you commit bigger
313
changes to see if something breaks. If you find out that some test is broken on your platform
314
please send me the whole output, some platform information, and the document you used for testing.
316
<p>It's a bit hard to test the proper parsing of a file, the best thing I came up with is a kind of
317
record and playback approach. The Python script <code>regression</code> can be used to compare the
318
filter output with some previously recorded output. This tool should be run with the <code>-r</code>
319
option before you do any major changes. The created files are quite a detailed recording of the parsing
320
process. After the changes are implemented you re-run the script without the <code>-r</code> option.
321
If the result differs you might want to check, whether the difference is intended.
323
<p>Code-wise there's not much to say about the unit tests. If you add new code please also add a test for it,
324
or at least tell me to do so. The header test.h contains a trivial test method and a method to convert
325
integers to strings (as <code>std::string</code> doesn't have such functionality).
327
<p>If you decide to create a unit test please ensure that it's self checking. That means if it runs till the end
328
everything is alright. If it stops somewhere in between something unexpected happened. Oh, and let me repeat
329
the warning that <code>UString::ascii()</code> might produce unexpected results due to the static buffer.
332
<h2>Pending Design Issues</h2>
333
<p>Currently the filter is in a pretty usable state, it is able to read the text including properties and styles,
334
it handles fonts, lists, headers/footers, footnotes and endnotes, sections, fields (to some extent, it's close
335
to impossible to do anything useful without knowing the target application), and tables.
336
This functionality is tested for Word 97, but I'm lacking test documents for Word 6 and Word 95. In theory most
337
of the mentioned features should work there too, but I doubt that lists work without any problems.
339
<p>This section of the design document lists my plans for features I'd like to implement next and some ideas about
343
<h4>Images and Graphic Objects</h4>
344
<p>Embedded images and graphic objects are a hard topic. According to Shaheed there are approximately 9 different
345
ways to have images embedded in a Word file, and the documentation is very brief. In newer Office versions
346
(anything from Office 97 on) Microsoft decided to share the graphics embedding code among Word, Excel, and
347
Powerpoint. This project is called <i>Escher</i> and some documentation can be found
348
<a href="./escher/escher.html">here</a>. Older Office versions are known to embed bitmaps directly in the
349
files, e.g. stored as .dib or .tiff image, or as .wmf drawing.
351
<p>Apart from raster images it's also possible to embed drawing objects (lines, rectangles,...) in a Word file.
352
These can be stored in an Escher container, or directly in the Word file (in older files). Due to OLE it's
353
also possible to embed e.g. AutoCAD drawings in a word file, but I didn't check how that's done yet. For
354
Far East versions of Word it seems to be possible to have a drawing grid for far east characters, but I
355
have no idea how that works as I have never seen a FE Word nor speak any far east language.
357
<p>One thing that seems to be common among all the embedded images and drawing objects (regardless of the Word
358
version) is that they are anchored using a special character (<code>SPEC_PICTURE = 1</code>) and of course
359
the <code>fSpec</code> flag is set. For this character it should be possible to find and construct the PICF.
361
<p>For Word 8 the important structures seem to be PICF and METAFILEPICT, the rest should be embedded in
362
Escher containers. For Word 6 we have the PICF, METAFILEPICT, DO and DP* (for the drawing primitives).
366
<p>Finally some questions that still make my head ache, from a design point of view:
369
<li>What to do with embedded documents (like Excel documents)?</li>
370
<li>How to handle embedded image/clipart/wmf/... files?</li>
371
<li>How can we handle bugs in the SPEC as effective as possible? It shouldn't be necessary that
372
two programmers lose their hair on the same bug...
376
<p>Please send comments, corrections, condolences, patches, and suggestions to
377
<a href="mailto:trobin@kde.org">Werner Trobin</a>. Thanks in advance. If you really read that
378
document till here I owe you a beverage of your choice next time we meet :-)</p>