1
Support exists for downloading, parsing, and loading the English
2
version of wikipedia (enwiki).
4
The build file can automatically try to download the most current
5
enwiki dataset (pages-articles.xml.bz2) from the "latest" directory,
6
http://download.wikimedia.org/enwiki/latest/. However, this file
7
doesn't always exist, depending on where wikipedia is in the dump
8
process and whether prior dumps have succeeded. If this file doesn't
9
exist, you can sometimes find an older or in progress version by
10
looking in the dated directories under
11
http://download.wikimedia.org/enwiki/. For example, as of this
12
writing, there is a page file in
13
http://download.wikimedia.org/enwiki/20070402/. You can download this
14
file manually and put it in temp. Note that the file you download will
15
probably have the date in the name, e.g.,
16
http://download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-articles.xml.bz2. When
17
you put it in temp, rename it to enwiki-latest-pages-articles.xml.bz2.
19
After that, ant enwiki should process the data set and run a load
20
test. Ant targets get-enwiki, expand-enwiki, and extract-enwiki can
21
also be used to download, decompress, and extract (to individual files
22
in work/enwiki) the dataset, respectively.