1
=====================================================================
10
Java Programs for Machine Learning
12
Copyright (C) 1998-2007 University of Waikato
14
web: http://www.cs.waikato.ac.nz/~ml/weka
16
=====================================================================
22
1. Using one of the graphical user interfaces in Weka
24
2. The Weka data format (ARFF)
26
3. Using Weka from the command line
34
5. The Experiment package (and tutorial)
38
7. KnowlodegeFlow tutorial
40
8. Bayesian Network Classifiers
46
11. Submission of code and bug reports
51
----------------------------------------------------------------------
53
1. Using one of the graphical user interfaces in Weka:
54
------------------------------------------------------
56
This assumes that the Weka archive that you have downloaded has been
57
extracted into a directory containing this README and that you haven't
58
used an automatic installer (e.g. the one provided for Windows).
60
Weka 3.5 requires Java 1.5 or higher. Depending on your platform you
61
may be able to just double-click on the weka.jar icon to run the
62
graphical user interfaces for Weka. Otherwise, from a command-line
63
(assuming you are in the directory containing weka.jar), type
67
or if you are using Windows use
72
Using "-jar" overrides your CLASSPATH variable! If you need to
73
use classes specified in the CLASSPATH, use the following command
76
java -classpath $CLASSPATH:weka.jar weka.gui.Main
78
or if you are using Windows use
80
javaw -classpath "%CLASSPATH%;weka.jar" weka.gui.Main
82
This will start a graphical user interface (weka.gui.Main) from
83
which you can select various interfaces, like the SimpleCLI interface
84
or the more sophisticated Explorer, Experimenter, and Knowledge Flow
85
interfaces. SimpleCLI just acts like a simple command shell. The
86
Explorer is currently the main interface for data analysis using
87
Weka. The Experimenter can be used to compare the performance of
88
different learning algorithms across various datasets. The Knowledge
89
Flow provides a component-based alternative to the Explorer interface.
91
Example datasets that can be used with Weka are in the sub-directory
92
called "data", which should be located in the same directory as this
95
The Weka user interfaces provide extensive built-in help facilities
96
(tool tips, etc.). Documentation for the Explorer can be found in
97
ExplorerGuide.pdf (also in the same directory as this
100
You can also start the GUIChooser from within weka.jar:
102
java -classpath weka.jar:$CLASSPATH weka.gui.GUIChooser
103
or if you are using Windows use
104
javaw -classpath weka.jar;$CLASSPATH weka.gui.GUIChooser
106
----------------------------------------------------------------------
108
2. The Weka data format (ARFF):
109
-------------------------------
111
Datasets for WEKA should be formatted according to the ARFF
112
format. (However, there are several converters included in WEKA that
113
can convert other file formats to ARFF. The Weka Explorer will use
114
these automatically if it doesn't recognize a given file as an ARFF
115
file.) Examples of ARFF files can be found in the "data" subdirectory.
116
What follows is a short description of the file format. A more
117
complete description is available from the Weka web page.
119
A dataset has to start with a declaration of its name:
123
followed by a list of all the attributes in the dataset (including
124
the class attribute). These declarations have the form
126
@attribute attribute_name specification
128
If an attribute is nominal, specification contains a list of the
129
possible attribute values in curly brackets:
131
@attribute nominal_attribute {first_value, second_value, third_value}
133
If an attribute is numeric, specification is replaced by the keyword
134
numeric: (Integer values are treated as real numbers in WEKA.)
136
@attribute numeric_attribute numeric
138
In addition to these two types of attributes, there also exists a
139
string attribute type. This attribute provides the possibility to
140
store a comment or ID field for each of the instances in a dataset:
142
@attribute string_attribute string
144
After the attribute declarations, the actual data is introduced by a
148
tag, which is followed by a list of all the instances. The instances
149
are listed in comma-separated format, with a question mark
150
representing a missing value.
152
Comments are lines starting with % and are ignored by Weka.
154
----------------------------------------------------------------------
156
3. Using Weka from the command line:
157
------------------------------------
159
If you want to use Weka from your standard command-line interface
160
(e.g. bash under Linux):
162
a) Set WEKAHOME to be the directory which contains this README.
163
b) Add $WEKAHOME/weka.jar to your CLASSPATH environment variable.
164
c) Bookmark $WEKAHOME/doc/packages.html in your web browser.
166
Alternatively you can try using the SimpleCLI user interface available
167
from the GUI chooser discussed above.
169
In the following, the names of files assume use of a unix command-line
170
with environment variables. For other command-lines (including
171
SimpleCLI) you should substitute the name of the directory where
172
weka.jar lives for $WEKAHOME. If your platform uses something other
173
character than / as the path separator, also make the appropriate
182
java weka.classifiers.trees.J48 -t $WEKAHOME/data/iris.arff
184
This prints out a decision tree classifier for the iris dataset
185
and ten-fold cross-validation estimates of its performance. If you
186
don't pass any options to the classifier, WEKA will list all the
187
available options. Try:
189
java weka.classifiers.trees.J48
191
The options are divided into "general" options that apply to most
192
classification schemes in WEKA, and scheme-specific options that only
193
apply to the current scheme---in this case J48. WEKA has a common
194
interface to all classification methods. Any class that implements a
195
classifier can be used in the same way as J48 is used above. WEKA
196
knows that a class implements a classifier if it extends the
197
Classifier class in weka.classifiers. Almost all classes in
198
weka.classifiers fall into this category. Try, for example:
200
java weka.classifiers.bayes.NaiveBayes -t $WEKAHOME/data/labor.arff
202
Here is a list of some of the classifiers currently implemented in
205
a) Classifiers for categorical prediction:
207
weka.classifiers.lazy.IBk: k-nearest neighbour learner
208
weka.classifiers.trees.J48: C4.5 decision trees
209
weka.classifiers.rules.PART: rule learner
210
weka.classifiers.bayes.NaiveBayes: naive Bayes with/without kernels
211
weka.classifiers.rules.OneR: Holte's OneR
212
weka.classifiers.functions.SMO: support vector machines
213
weka.classifiers.functions.Logistic: logistic regression
214
weka.classifiers.meta.AdaBoostM1: AdaBoost
215
weka.classifiers.meta.LogitBoost: logit boost
216
weka.classifiers.trees.DecisionStump: decision stumps (for boosting)
219
b) Classifiers for numeric prediction:
221
weka.classifiers.functions.LinearRegression: linear regression
222
weka.classifiers.trees.M5P: model trees
223
weka.classifiers.rules.M5Rules: model rules
224
weka.classifiers.lazy.IBk: k-nearest neighbour learner
225
weka.classifiers.lazy.LWL: locally weighted learning
231
Next to classification schemes, there is some other useful stuff in
232
WEKA. Association rules, for example, can be extracted using the
233
Apriori algorithm. Try
235
java weka.associations.Apriori -t $WEKAHOME/data/weather.nominal.arff
241
There are also a number of tools that allow you to manipulate a
242
dataset. These tools are called filters in WEKA and can be found
245
weka.filters.unsupervised.attribute.Discretize: discretizes numeric data
246
weka.filters.unsupervised.attribute.Remove: deletes/selects attributes
251
java weka.filters.supervised.attribute.Discretize -i
252
$WEKAHOME/data/iris.arff -c last
254
----------------------------------------------------------------------
259
In terms of database connectivity, you should be able to use any
260
database with a Java JDBC driver. When using classes that access a
261
database (e.g. the Explorer), you will probably want to create a
262
properties file that specifies which JDBC drivers to use, where to
263
find the database, and specify a mapping for the data types. This file
264
should reside in your home directory or the current directory and be
265
called "DatabaseUtils.props". An example is provided in
266
weka/experiment (you need to expand weka.jar to be able to look a this
267
file). Note that the settings in this file are used unless they are
268
overidden by settings in the DatabaseUtils.props file in your home
269
directory or the current directory (in that order).
271
There are also example DatabaseUtils.props files for several common
272
databases available (also in weka/experiment):
274
* HSQLDB: DatabaseUtils.props.hsql
275
* MS SQL Server 2000: DatabaseUtils.props.mssqlserver
276
* MS SQL Server 2005 Express Edition: DatabaseUtils.props.mssqlserver2005
277
* MySQL: DatabaseUtils.props.mysql
278
* ODBC: DatabaseUtils.props.odbc
279
* Oracle: DatabaseUtils.props.oracle
280
* PostgreSQL: DatabaseUtils.props.postgresql
282
----------------------------------------------------------------------
284
5. The Experiment package (and tutorial):
285
-----------------------------------------
287
There is support for running experiments that involve evaluating
288
classifiers on repeated randomizations of datasets, over multiple
289
datasets (you can do much more than this, besides). The classes for
290
this reside in the weka.experiment package. The basic architecture is
291
that a ResultProducer (which generates results on some randomization
292
of a dataset) sends results to a ResultListener (which is responsible
293
for stating whether it already has the result, and otherwise storing
296
Example ResultListeners include:
298
weka.experiment.CSVResultListener: outputs results as
299
comma-separated-value files.
300
weka.experiment.InstancesResultListener: converts results into a set
302
weka.experiment.DatabaseResultListener: sends results to a database
305
Example ResultProducers include:
307
weka.experiment.RandomSplitResultProducer: train/test on a % split
308
weka.experiment.CrossValidationResultProducer: n-fold cross-validation
309
weka.experiment.AveragingResultProducer: averages results from another
311
weka.experiment.DatabaseResultProducer: acts as a cache for results,
312
storing them in a database.
314
The RandomSplitResultProducer and CrossValidationResultProducer make
315
use of a SplitEvaluator to obtain actual results for a particular
316
split, provided are ClassifierSplitEvaluator (for nominal
317
classification) and RegressionSplitEvaluator (for numeric
318
classification). Each of these uses a Classifier for actual results
321
So, you might have a DatabaseResultListener, that is sent results from
322
an AveragingResultProducer, which produces averages over the n results
323
produced for each run of an n-fold CrossValidationResultProducer,
324
which in turn is doing nominal classification through a
325
ClassifierSplitEvaluator, which uses OneR for prediction. Whew. But
326
you can combine these things together to do pretty much whatever you
327
want. You might want to write a LearningRateResultProducer that splits
328
a dataset into increasing numbers of training instances.
330
To run a simple experiment from the command line, try:
332
java weka.experiment.Experiment -r -T datasets/UCI/iris.arff \
333
-D weka.experiment.InstancesResultListener \
334
-P weka.experiment.RandomSplitResultProducer -- \
335
-W weka.experiment.ClassifierSplitEvaluator -- \
336
-W weka.classifiers.rules.OneR
338
(Try "java weka.experiment.Experiment -h" to find out what these
341
If you have your results as a set of instances, you can perform paired
342
t-tests using weka.experiment.PairedTTester (use the -h option to find
343
out what options it needs).
345
However, all this is much easier if you use the Experimenter GUI.
346
Check out the tutorial at: $WEKAHOME/ExperimenterTutorial.pdf
348
----------------------------------------------------------------------
353
A guide on how to use the WEKA Explorer is in
354
$WEKAHOME/ExplorerGuide.pdf. For an explanation on how to use the
355
other user interfaces in WEKA you might want to take a look at the
356
book "Data Mining" by Witten and Frank (2005) (see our web page).
358
----------------------------------------------------------------------
360
7. KnowledgeFlow Tutorial:
361
--------------------------
363
A tutorial on how to use the KnowledgeFlow is in
364
$WEKAHOME/KnowledgeFlowTutorial.pdf.
366
----------------------------------------------------------------------
368
8. Bayesian Network Classifiers:
369
--------------------------------
371
Information about the Bayesian Network classifiers in Weka, background
372
theory as well as useage, can be found in the
373
$WEKAHOME/BayesianNetClassifiers.pdf.
378
The source code for WEKA is in $WEKAHOME/weka-src.jar. To expand it,
379
use the jar utility that's in every Java distribution (or any file
380
archiver that can handle ZIP files).
382
----------------------------------------------------------------------
387
Refer to the web page for a list of contributors:
389
http://www.cs.waikato.ac.nz/~ml/weka/
391
----------------------------------------------------------------------
393
11. Call for code and bug reports:
394
---------------------------------
396
If you have implemented a learning scheme, filter, application,
397
visualization tool, etc., using the WEKA classes, and you think it
398
should be included in WEKA, send us the code, and we can potentially
399
put it in the next WEKA distribution.
401
The conditions for new classifiers (schemes in general) are that,
402
firstly, they have to be published in the proceedings of a renowned
403
conference (e.g., ICML) or as an article of respected journal (e.g.,
404
Machine Learning) and, secondly, that they outperform other standard
405
schemes (e.g., J48/C4.5).
407
If you find any bugs, send a bug report to the wekalist mailing list.
409
-----------------------------------------------------------------------
414
WEKA is distributed under the GNU public license. Please read
417
-----------------------------------------------------------------------