1
\documentclass[a4paper]{article}
3
% Version: $Revision: 1.3 $
7
\newenvironment{tight_itemize}{
9
\setlength{\itemsep}{1pt}
10
\setlength{\parskip}{0pt}
11
\setlength{\parsep}{0pt}}{\end{itemize}
14
\title{\epsfig{file=images/coat_of_arms.eps,width=10cm}\vspace{3cm}\\WEKA KnowledgeFlow Tutorial\\for Version 3-5-7}
15
\author{Mark Hall\\Peter Reutemann}
17
\setcounter{secnumdepth}{3}
18
\setcounter{tocdepth}{3}
21
\hyphenation{Generic-Object-Editor}
22
\hyphenation{DatabaseUtils}
34
\copyright 2007 University of Waikato
45
\section{Introduction}
47
The KnowledgeFlow provides an alternative to the Explorer as a
48
graphical front end to WEKA's core algorithms. The KnowledgeFlow is a
49
work in progress so some of the functionality from the Explorer is not
50
yet available. On the other hand, there are things that can be done in
51
the KnowledgeFlow but not in the Explorer.
54
\epsfig{file=images/knowledgeflow.eps,height=7cm}
57
The KnowledgeFlow presents a \textit{data-flow} inspired interface to
58
WEKA. The user can select WEKA components from a tool bar, place them
59
on a layout canvas and connect them together in order to form a
60
\textit{knowledge flow} for processing and analyzing data. At present, all of
61
WEKA's classifiers, filters, clusterers, loaders and savers are
62
available in the KnowledgeFlow along with some extra tools.
64
The KnowledgeFlow can handle data either incrementally or in batches
65
(the Explorer handles batch data only). Of course learning from data
66
incrementally requires a classifier that can be updated on an instance
67
by instance basis. Currently in WEKA there are ten classifiers that
68
can handle data incrementally:
74
\item NaiveBayesMultinomialUpdateable
75
\item NaiveBayesUpdateable
80
\noindent And two of them are meta classifiers:
82
\item \textit{RacedIncrementalLogitBoost} - that can use of any regression base
83
learner to learn from discrete class data incrementally.
84
\item \textit{LWL} - locally weighted learning.
87
\noindent This manual is also available online on the \textit{WekaDoc Wiki} \cite{wekadoc}.
96
The KnowledgeFlow offers the following features:
98
\item intuitive data flow style layout
99
\item process data in batches or incrementally
100
\item process multiple batches or streams in parallel (each separate flow
101
executes in its own thread)
102
\item chain filters together
103
\item view models produced by classifiers for each fold in a cross validation
104
\item visualize performance of incremental classifiers during
105
processing (scrolling plots of classification accuracy, RMS error,
111
Components available in the KnowledgeFlow:
113
\subsection{DataSources} All of WEKA's loaders are available.
115
\epsfig{file=images/components_datasources.eps,height=2cm}
118
\subsection{DataSinks} All of WEKA's savers are available.
120
\epsfig{file=images/components_datasinks.eps,height=2cm}
123
\subsection{Filters} All of WEKA's filters are available.
125
\epsfig{file=images/components_filters.eps,height=2cm}
128
\subsection{Classifiers} All of WEKA's classifiers are available.
130
\epsfig{file=images/components_classifiers.eps,height=2cm}
133
\subsection{Clusterers} All of WEKA's clusterers are available.
135
\epsfig{file=images/components_clusterers.eps,height=2cm}
138
\subsection{Evaluation}
140
\epsfig{file=images/components_evaluation.eps,height=2cm}
144
\item \textit{TrainingSetMaker} - make a data set into a training set.
145
\item \textit{TestSetMaker} - make a data set into a test set.
146
\item \textit{CrossValidationFoldMaker} - split any data set, training
147
set or test set into folds.
148
\item \textit{TrainTestSplitMaker} - split any data set, training set
149
or test set into a training set and a test set.
150
\item \textit{ClassAssigner} - assign a column to be the class for any
151
data set, training set or test set.
152
\item \textit{ClassValuePicker} - choose a class value to be considered
153
as the ``positive'' class. This is useful when generating data for ROC style
154
curves (see \textit{ModelPerformanceChart} below and example \ref{exampleroc}).
155
\item \textit{ClassifierPerformanceEvaluator} - evaluate the performance of
156
batch trained/tested classifiers.
157
\item \textit{IncrementalClassifierEvaluator} - evaluate the performance of
158
incrementally trained classifiers.
159
\item \textit{ClustererPerformanceEvaluator} - evaluate the performance of
160
batch trained/tested clusterers.
161
\item \textit{PredictionAppender} - append classifier predictions to a test
162
set. For discrete class problems, can either append predicted class labels or
163
probability distributions.
167
\subsection{Visualization}
169
\epsfig{file=images/components_visualization.eps,height=2cm}
173
\item \textit{DataVisualizer} - component that can pop up a panel for
174
visualizing data in a single large 2D scatter plot.
175
\item \textit{ScatterPlotMatrix} - component that can pop up a panel
176
containing a matrix of small scatter plots (clicking on a small plot
177
pops up a large scatter plot).
178
\item \textit{AttributeSummarizer} - component that can pop up a panel
179
containing a matrix of histogram plots - one for each of the attributes
181
\item \textit{ModelPerformanceChart} - component that can pop up a
182
panel for visualizing threshold (i.e. ROC style) curves.
183
\item \textit{TextViewer} - component for showing textual data. Can show
184
data sets, classification performance statistics etc.
185
\item \textit{GraphViewer} - component that can pop up a panel for
186
visualizing tree based models.
187
\item \textit{StripChart} - component that can pop up a panel that displays
188
a scrolling plot of data (used for viewing the online performance of
189
incremental classifiers).
199
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
200
% Example: cross-validated J48 %
201
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
203
\subsection{Cross-validated J48}
204
Setting up a flow to load an ARFF file (batch mode) and
205
perform a cross-validation using J48 (WEKA's C4.5 implementation).
208
\epsfig{file=images/example_j48.eps,height=4.5cm}
212
\item Click on the DataSources tab and choose \textit{ArffLoader} from the
213
toolbar (the mouse pointer will change to a \textit{cross hairs}).
215
\item Next place the ArffLoader component on the layout area by clicking
216
somewhere on the layout (a copy of the ArffLoader icon will appear on
219
\item Next specify an ARFF file to load by first right clicking the mouse
220
over the ArffLoader icon on the layout. A pop-up menu will
221
appear. Select \textit{Configure} under \textit{Edit} in the list from this menu and
222
browse to the location of your ARFF file.
224
\item Next click the \textit{Evaluation} tab at the top of the window and choose the
225
\textit{ClassAssigner} (allows you to choose which column to be the class)
226
component from the toolbar. Place this on the layout.
228
\item Now connect the ArffLoader to the ClassAssigner: first right click
229
over the ArffLoader and select the \textit{dataSet} under \textit{Connections} in
230
the menu. A \textit{rubber band} line will appear. Move the mouse over the
231
ClassAssigner component and left click - a red line labeled \textit{dataSet}
232
will connect the two components.
234
\item Next right click over the ClassAssigner and choose \textit{Configure} from
235
the menu. This will pop up a window from which you can specify which
236
column is the class in your data (last is the default).
238
\item Next grab a \textit{CrossValidationFoldMaker} component from the Evaluation
239
toolbar and place it on the layout. Connect the ClassAssigner to the
240
CrossValidationFoldMaker by right clicking over \textit{ClassAssigner} and
241
selecting \textit{dataSet} from under \textit{Connections} in the menu.
243
\item Next click on the \textit{Classifiers} tab at the top of the window and
244
scroll along the toolbar until you reach the \textit{J48} component in the
245
\textit{trees} section. Place a J48 component on the layout.
247
\item Connect the CrossValidationFoldMaker to J48 TWICE by first choosing
248
\textit{trainingSet} and then \textit{testSet} from the pop-up menu for the
249
CrossValidationFoldMaker.
251
\item Next go back to the \textit{Evaluation} tab and place a
252
\textit{ClassifierPerformanceEvaluator} component on the layout. Connect J48
253
to this component by selecting the \textit{batchClassifier} entry from the
256
\item Next go to the \textit{Visualization} toolbar and place a \textit{TextViewer}
257
component on the layout. Connect the ClassifierPerformanceEvaluator to
258
the TextViewer by selecting the \textit{text} entry from the pop-up menu for
259
ClassifierPerformanceEvaluator.
261
\item Now start the flow executing by selecting \textit{Start loading} from the
262
pop-up menu for ArffLoader. Depending on how big the data set is and
263
how long cross-validation takes you will see some animation from some
264
of the icons in the layout (J48's tree will \textit{grow} in the icon and the
265
ticks will animate on the ClassifierPerformanceEvaluator). You will
266
also see some progress information in the \textit{Status} bar and \textit{Log} at
267
the bottom of the window.
270
When finished you can view the results by choosing \textit{Show results} from
271
the pop-up menu for the \textit{TextViewer} component.
273
Other cool things to add to this flow: connect a \textit{TextViewer} and/or a
274
\textit{GraphViewer} to J48 in order to view the textual or graphical
275
representations of the trees produced for each fold of the cross
276
validation (this is something that is not possible in the Explorer).
278
%%%%%%%%%%%%%%%%%%%%%%%%%
279
% Example: multiple ROC %
280
%%%%%%%%%%%%%%%%%%%%%%%%%
283
\subsection{Plotting multiple ROC curves}
285
The KnowledgeFlow can draw multiple ROC curves in the same plot window, something that the
286
Explorer cannot do. In this example we use \textit{J48} and \textit{RandomForest}
287
as classifiers. This example can be found on the \textit{WekaWiki} as well \cite{multipleroc}.
290
\epsfig{file=images/example_multiple_roc.eps,height=4cm}
294
\item Click on the DataSources tab and choose \textit{ArffLoader} from the
295
toolbar (the mouse pointer will change to a \textit{cross hairs}).
297
\item Next place the ArffLoader component on the layout area by clicking
298
somewhere on the layout (a copy of the ArffLoader icon will appear on
301
\item Next specify an ARFF file to load by first right clicking the mouse
302
over the ArffLoader icon on the layout. A pop-up menu will
303
appear. Select \textit{Configure} under \textit{Edit} in the list from this menu and
304
browse to the location of your ARFF file.
306
\item Next click the \textit{Evaluation} tab at the top of the window and choose the
307
\textit{ClassAssigner} (allows you to choose which column to be the class)
308
component from the toolbar. Place this on the layout.
310
\item Now connect the ArffLoader to the ClassAssigner: first right click
311
over the ArffLoader and select the \textit{dataSet} under \textit{Connections} in
312
the menu. A \textit{rubber band} line will appear. Move the mouse over the
313
ClassAssigner component and left click - a red line labeled \textit{dataSet}
314
will connect the two components.
316
\item Next right click over the ClassAssigner and choose \textit{Configure} from
317
the menu. This will pop up a window from which you can specify which
318
column is the class in your data (last is the default).
320
\item Next choose the \textit{ClassValuePicker} (allows you to choose which class
321
label to be evaluated in the ROC) component from the toolbar. Place this on the layout
322
and right click over \textit{ClassAssigner} and select \textit{dataSet} from under
323
\textit{Connections} in the menu and connect it with the \textit{ClassValuePicker}.
325
\item Next grab a \textit{CrossValidationFoldMaker} component from the Evaluation
326
toolbar and place it on the layout. Connect the ClassAssigner to the
327
CrossValidationFoldMaker by right clicking over \textit{ClassAssigner} and
328
selecting \textit{dataSet} from under \textit{Connections} in the menu.
330
\item Next click on the \textit{Classifiers} tab at the top of the window and
331
scroll along the toolbar until you reach the \textit{J48} component in the
332
\textit{trees} section. Place a J48 component on the layout.
334
\item Connect the CrossValidationFoldMaker to J48 TWICE by first choosing
335
\textit{trainingSet} and then \textit{testSet} from the pop-up menu for the
336
CrossValidationFoldMaker.
338
\item Repeat these two steps with the RandomForest classifier.
340
\item Next go back to the \textit{Evaluation} tab and place a
341
\textit{ClassifierPerformanceEvaluator} component on the layout. Connect J48
342
to this component by selecting the \textit{batchClassifier} entry from the
343
pop-up menu for J48. Add another \textit{ClassifierPerformanceEvaluator} for
344
RandomForest and connect them via \textit{batchClassifier} as well.
346
\item Next go to the \textit{Visualization} toolbar and place a
347
\textit{ModelPerformanceChart} component on the layout. Connect both
348
ClassifierPerformanceEvaluators to the ModelPerformanceChart by selecting
349
the \textit{thresholdData} entry from the pop-up menu for ClassifierPerformanceEvaluator.
351
\item Now start the flow executing by selecting \textit{Start loading} from the
352
pop-up menu for ArffLoader. Depending on how big the data set is and
353
how long cross validation takes you will see some animation from some
354
of the icons in the layout. You will also see some progress information in the
355
\textit{Status} bar and \textit{Log} at the bottom of the window.
357
\item Select \textit{Show plot} from the popup-menu of the
358
\textit{ModelPerformanceChart} under the \textit{Actions} section.
361
Here are the two ROC curves generated from the UCI dataset \textit{credit-g},
362
evaluated on the class label \textit{good}:
365
\epsfig{file=images/example_multiple_roc_output.eps,height=8.5cm}
369
\begin{thebibliography}{999}
370
\bibitem{witten} Witten, I.H. and Frank, E. (2005) \textit{Data Mining: Practical machine
371
learning tools and techniques. 2nd edition} Morgan Kaufmann, San
373
\bibitem{wekadoc} \textit{WekaDoc} -- \texttt{http://weka.sourceforge.net/wekadoc/}
374
\bibitem{wekawiki} \textit{WekaWiki} -- \texttt{http://weka.sourceforge.net/wekawiki/}
375
\bibitem{multipleroc} \textit{Plotting multiple ROC curves} on \textit{WekaWiki} -- \\
376
\small{\texttt{http://weka.sourceforge.net/wiki/index.php/Plotting\_multiple\_ROC\_curves}}
377
\end{thebibliography}