~ubuntu-branches/ubuntu/precise/weka/precise

The new menu-driven GUI in WEKA (class \texttt{weka.gui.Main}) succeeds the old GUI Chooser (class \texttt{weka.gui.GUIChooser}). Its MDI (``multiple document interface'') appearance makes it easier to keep track of all the open windows. If one prefers an SDI (``single document interface'') driven layout, one can invoke this with option \texttt{-gui sdi} on the commandline.

\begin{center}

\epsfig{file=images/main.eps,height=3cm}

\end{center}

\begin{wrapfigure}{r}{0.1cm}

\vspace{0.2cm}

\epsfig{file=images/main_program.eps,height=1.2cm}

\vspace{0.2cm}

\epsfig{file=images/main_applications.eps,height=1.8cm}

\vspace{0.2cm}

\epsfig{file=images/main_tools.eps,height=1.2cm}

\vspace{0.2cm}

\epsfig{file=images/main_visualization.eps,height=2.1cm}

\vspace{0.2cm}

\epsfig{file=images/main_windows.eps,height=2.4cm}

\vspace{0.2cm}

\epsfig{file=images/main_help.eps,height=2.1cm}

\end{wrapfigure}

The menu consists of six sections:

\begin{enumerate}

\item \textbf{Program}

\begin{itemize}

\item \textbf{LogWindow} Opens a log window that captures all that is printed to \textit{stdout} or \textit{stderr}. Useful for environments like MS Windows, where WEKA is normally not started from a terminal.

\item \textbf{Exit} Closes WEKA.

\end{itemize}

\item \textbf{Applications} Lists the main applications within WEKA.

\begin{itemize}

\item \textbf{Explorer} An environment for exploring data with

WEKA (the rest of this documentation deals with this application in more detail).

\item \textbf{Experimenter} An environment for performing experiments and conducting statistical tests

between learning schemes.

\item \textbf{KnowledgeFlow} This environment supports essentially

the same functions as the Explorer but with a drag-and-drop

interface. One advantage is that it supports incremental learning.

\item \textbf{SimpleCLI} Provides a simple command-line interface

that allows direct execution of WEKA commands for operating systems

that do not provide their own command line interface.

\end{itemize}

\item \textbf{Tools} Other useful applications.

\begin{itemize}

\item \textbf{ArffViewer} An MDI application for viewing ARFF files in spreadsheet format.

\item \textbf{SqlViewer} represents an SQL worksheet, for querying databases via JDBC.

\end{itemize}

\item \textbf{Visualization} Ways of visualizing data with WEKA.

\begin{itemize}

\item \textbf{Plot} For plotting a 2D plot of a dataset.

\item \textbf{ROC} Displays a previously saved ROC curve.

\item \textbf{TreeVisualizer} For displaying directed graphs, e.g., a decision tree.

\item \textbf{GraphVisualizer} Visualizes XML BIF or DOT format graphs, e.g., for Bayesian networks.

\item \textbf{BoundaryVisualizer} Allows the visualization of classifier decision boundaries in two dimensions.

\end{itemize}

\item \textbf{Windows} All open windows are listed here.

\begin{itemize}

100

\item \textbf{Minimize} Minimizes all current windows.

101

\item \textbf{Restore} Restores all minimized windows again.

102

\end{itemize}

103

104

\item \textbf{Help} Online resources for WEKA can be found here.

105

\begin{itemize}

106

\item \textbf{Weka homepage} Opens a browser window with WEKA's homepage.

107

\item \textbf{Online documentation} Directs to the WekaDoc Wiki \cite{wekadoc}.

108

\item \textbf{HOWTOs, code snippets, etc.} The general WekaWiki \cite{wekawiki}, containing lots of examples and HOWTOs around the development and use of WEKA.

109

\item \textbf{Weka on Sourceforge} WEKA's project homepage on Sourceforge.net.

110

\item \textbf{SystemInfo} Lists some internals about the Java/WEKA environment, e.g., the \texttt{CLASSPATH}.

111

\item \textbf{About} The infamous ``About'' box.

112

\end{itemize}

113

\end{enumerate}

114

115

To make it easy for the user to add new functionality to the menu without having to modify

116

the code of WEKA itself, the GUI now offers a plugin mechanism for such add-ons.

117

Due to the inherent dynamic

118

class discovery, plugins only need to implement the \texttt{weka.gui.MainMenuExtension}

119

interface and WEKA notified of the package they reside in to be displayed in the menu under

120

``Extensions'' (this extra menu appears automatically as soon as extensions are discovered).

121

More details can be found in the Wiki article ``Extensions for Weka's main GUI''

122

\cite{mainextensions}.

123

124

If you launch WEKA from a terminal window, some text begins scrolling in the

125

terminal. Ignore this text unless something goes wrong, in which case it can

126

help in tracking down the cause (the \textit{LogWindow} from the \textit{Program} menu

127

displays that information as well).

128

129

This User Manual, which is also available online on the \textit{WekaDoc Wiki}

130

\cite{wekadoc}, focuses on using the Explorer but does not explain

131

the individual data preprocessing tools and learning algorithms in

132

WEKA. For more information on the various filters and learning methods

133

in WEKA, see the book {\em Data Mining} \cite{witten}.

134

135

\newpage

136

137

\section{The WEKA Explorer}

138

139

\subsection{Section Tabs}

140

141

At the very top of the window, just below the title bar, is a row of

142

tabs. When the Explorer is first started only the first tab is active;

143

the others are greyed out. This is because it is necessary to open

144

(and potentially pre-process) a data set before starting to explore

145

the data.

146

147

The tabs are as follows:

148

149

\begin{enumerate}

150

\item \textbf{Preprocess}.

151

Choose and modify the data being acted on.

152

\item \textbf{Classify}.

153

Train and test learning schemes that classify or perform regression.

154

\item \textbf{Cluster}.

155

Learn clusters for the data.

156

\item \textbf{Associate}.

157

Learn association rules for the data.

158

\item \textbf{Select attributes}.

159

Select the most relevant attributes in the data.

160

\item \textbf{Visualize}.

161

View an interactive 2D plot of the data.

162

\end{enumerate}

163

\noindent

164

Once the tabs are active, clicking on them flicks between different

165

screens, on which the respective actions can be performed. The bottom

166

area of the window (including the status box, the log button, and the

167

Weka bird) stays visible regardless of which section you are in.

168

169

The Explorer can be easily extended with custom tabs. The Wiki

170

article ``Adding tabs in the Explorer'' \cite{explorertabs} explains

171

this in detail.

172

173

174

\subsection{Status Box}

175

176

The status box appears at the very bottom of the window. It displays

177

messages that keep you informed about what's going on. For example, if

178

the Explorer is busy loading a file, the status box will say that.

179

180

\textbf{TIP}---right-clicking the mouse anywhere inside the status box brings

181

up a little menu. The menu gives two options:

182

183

\begin{enumerate}

184

\item \textbf{Memory information}.

185

Display in the log box the amount of memory available to WEKA.

186

\item \textbf{Run garbage collector}.

187

Force the Java garbage collector to search for memory that is no longer needed

188

and free it up, allowing more memory for new tasks. Note that the garbage

189

collector is constantly running as a background task anyway.

190

\end{enumerate}

191

192

\subsection{Log Button}

193

194

Clicking on this button brings up a separate window containing a scrollable text

195

field. Each line of text is stamped with the time it was entered into the

196

log. As you perform actions in WEKA, the log keeps a record of what has

197

happened. For people using the command line or the SimpleCLI, the log now also

198

contains the full setup strings for classification, clustering, attribute selection,

199

etc., so that it is possible to copy/paste them elsewhere.

200

Options for dataset(s) and, if applicable, the class attribute still have to

201

be provided by the user (e.g., \texttt{-t} for classifiers or \texttt{-i}

202

and \texttt{-o} for filters).

203

204

\subsection{WEKA Status Icon}

205

206

To the right of the status box is the WEKA status icon. When no processes are

207

running, the bird sits down and takes a nap. The number beside the $\times$

208

symbol gives the number of concurrent processes running. When the system is

209

idle it is zero, but it increases as the number of processes increases.

210

When any process is started, the bird gets up and starts moving around. If

211

it's standing but stops moving for a long time, it's sick: something has gone

212

wrong! In that case you should restart the WEKA Explorer.

213

214

\subsection{Graphical output}

215

216

Most graphical displays in WEKA, e.g., the GraphVisualizer or the

217

TreeVisualizer, support saving the output to a file. A dialog for saving

218

the output can be brought up with \textit{Alt+Shift+left-click}.

219

Supported formats are currently Windows Bitmap, JPEG, PNG and EPS

220

(encapsulated Postscript). The dialog also allows you to specify the

221

dimensions of the generated image.

222

223

\newpage

224

225

\section{Preprocessing}

226

227

\begin{center}

228

\epsfig{file=images/explorer_preprocess.eps,height=7cm}

229

\end{center}

230

231

\subsection{Loading Data}

232

233

The first four buttons at the top of the preprocess section enable

234

you to load data into WEKA:

235

236

\begin{enumerate}

237

\item \textbf{Open file...}.

238

Brings up a dialog box allowing you to browse for the data file on the local

239

file system.

240

\item \textbf{Open URL...}.

241

Asks for a Uniform Resource Locator address for where the data is stored.

242

\item \textbf{Open DB...}. Reads data from a database. (Note that to

243

make this work you might have to edit the file in

244

weka/experiment/DatabaseUtils.props.)

245

\item \textbf{Generate...}. Enables you to generate artificial data

246

from a variety of DataGenerators.

247

\end{enumerate}

248

\noindent

249

Using the \textbf{Open file...} button you can read files in a variety

250

of formats: WEKA's ARFF format, CSV format, C4.5 format, or serialized

251

Instances format. ARFF files typically have a {\em .arff\/}

252

extension, CSV files a {\em .csv\/} extension, C4.5 files a {\em

253

.data\/} and {\em .names\/} extension, and serialized Instances

254

objects a {\em .bsi\/} extension.

255

256

\textbf{NB:} This list of formats can be extended by adding custom file

257

converters to the \texttt{weka.core.converters} package.

258

259

\subsection{The Current Relation}

260

261

Once some data has been loaded, the Preprocess panel shows a variety

262

of information. The \textbf{Current relation} box (the ``current

263

relation'' is the currently loaded data, which can be interpreted as a

264

single relational table in database terminology) has three entries:

265

266

\begin{enumerate}

267

\item \textbf{Relation}.

268

The name of the relation, as given in the file it was loaded from. Filters

269

(described below) modify the name of a relation.

270

\item \textbf{Instances}.

271

The number of instances (data points/records) in the data.

272

\item \textbf{Attributes}.

273

The number of attributes (features) in the data.

274

\end{enumerate}

275

276

\subsection{Working With Attributes}

277

278

Below the \textbf{Current relation} box is a box titled \textbf{Attributes}.

279

There are four buttons, and beneath them is a list of the attributes in the

280

current relation. The list has three columns:

281

282

\begin{enumerate}

283

\item \textbf{No.}.

284

A number that identifies the attribute in the order they are specified in the

285

data file.

286

\item \textbf{Selection tick boxes}.

287

These allow you select which attributes are present in the relation.

288

\item \textbf{Name}.

289

The name of the attribute, as it was declared in the data file.

290

\end{enumerate}

291

292

When you click on different rows in the list of attributes, the fields

293

change in the box to the right titled \textbf{Selected

294

attribute}. This box displays the characteristics of the currently

295

highlighted attribute in the list:

296

297

\begin{enumerate}

298

\item \textbf{Name}.

299

The name of the attribute, the same as that given in the attribute list.

300

\item \textbf{Type}.

301

The type of attribute, most commonly Nominal or Numeric.

302

\item \textbf{Missing}.

303

The number (and percentage) of instances in the data for which this attribute

304

is missing (unspecified).

305

\item \textbf{Distinct}.

306

The number of different values that the data contains for this attribute.

307

\item \textbf{Unique}.

308

The number (and percentage) of instances in the data having a value for this

309

attribute that no other instances have.

310

\end{enumerate}

311

\noindent

312

Below these statistics is a list showing more information about the

313

values stored in this attribute, which differ depending on its type.

314

If the attribute is nominal, the list consists of each possible value

315

for the attribute along with the number of instances that have that

316

value. If the attribute is numeric, the list gives four statistics

317

describing the distribution of values in the data---the minimum,

318

maximum, mean and standard deviation. And below these statistics

319

there is a coloured histogram, colour-coded according to the attribute

320

chosen as the {\it Class} using the box above the histogram. (This box

321

will bring up a drop-down list of available selections when clicked.)

322

Note that only nominal {\it Class} attributes will result in a

323

colour-coding. Finally, after pressing the \textbf{Visualize All}

324

button, histograms for all the attributes in the data are shown in a

325

separate window.

326

327

Returning to the attribute list, to begin with all the tick boxes are unticked.

328

They can be toggled on/off by clicking on them individually. The four buttons

329

above can also be used to change the selection:

330

331

\begin{enumerate}

332

\item \textbf{All}.

333

All boxes are ticked.

334

\item \textbf{None}.

335

All boxes are cleared (unticked).

336

\item \textbf{Invert}.

337

Boxes that are ticked become unticked and {\em vice versa\/}.

338

\item \textbf{Pattern}.

339

Enables the user to select attributes based on a Perl 5 Regular Expression. E.g.,

340

\texttt{.*\_id} selects all attributes which name ends with \texttt{\_id}.

341

\end{enumerate}

342

343

Once the desired attributes have been selected, they can be removed by

344

clicking the \textbf{Remove} button below the list of attributes.

345

Note that this can be undone by clicking the \textbf{Undo} button,

346

which is located next to the \textbf{Edit} button in the top-right

347

corner of the Preprocess panel.

348

349

\subsection{Working With Filters}

350

351

\begin{center}

352

\epsfig{file=images/explorer_preprocess_filter.eps,height=7cm}

353

\end{center}

354

355

The preprocess section allows filters to be defined that transform the

356

data in various ways. The \textbf{Filter} box is used to set up the

357

filters that are required. At the left of the \textbf{Filter} box is

358

a \textbf{Choose} button. By clicking this button it is possible to

359

select one of the filters in WEKA. Once a filter has been selected,

360

its name and options are shown in the field next to the

361

\textbf{Choose} button. Clicking on this box with the \textit{left} mouse button

362

brings up a GenericObjectEditor dialog box. A click with the \textit{right}

363

mouse button (or \textit{Alt+Shift+left click}) brings up a menu where you can

364

choose, either to display the properties in a GenericObjectEditor dialog

365

box, or to copy the current setup string to the clipboard.

366

367

\subsubsection*{The GenericObjectEditor Dialog Box}

368

369

The GenericObjectEditor dialog box lets you configure a filter. The

370

same kind of dialog box is used to configure other objects, such as

371

classifiers and clusterers (see below). The fields in the window

372

reflect the available options.

373

374

Right-clicking (or \textit{Alt+Shift+Left-Click}) on such a field

375

will bring up a popup menu, listing the following options:

376

\begin{enumerate}

377

\item \textbf{Show properties...} has the same effect as left-clicking

378

on the field, i.e., a dialog appears allowing you to alter the settings.

379

\item \textbf{Copy configuration to clipboard} copies the currently

380

displayed configuration string to the system's clipboard and therefore

381

can be used anywhere else in WEKA or in the console. This is rather handy

382

if you have to setup complicated, nested schemes.

383

\item \textbf{Enter configuration...} is the ``receiving'' end for

384

configurations that got copied to the clipboard earlier on. In this dialog you

385

can enter a classname followed by options (if the class supports these).

386

This also allows you to transfer a filter setting from the Preprocess

387

panel to a \texttt{FilteredClassifier} used in the Classify panel.

388

\end{enumerate}

389

390

Left-Clicking on any of these gives an

391

opportunity to alter the filters settings. For example, the setting

392

may take a text string, in which case you type the string into the

393

text field provided. Or it may give a drop-down box listing several

394

states to choose from. Or it may do something else, depending on the

395

information required. Information on the options is provided in a tool

396

tip if you let the mouse pointer hover of the corresponding

397

field. More information on the filter and its options can be obtained

398

by clicking on the \textbf{More} button in the \textbf{About} panel at

399

the top of the GenericObjectEditor window.

400

401

Some objects display a brief description of what they do in an \textbf{About}

402

box, along with a \textbf{More} button. Clicking on the \textbf{More} button

403

brings up a window describing what the different options do. Others have an

404

additional button, \textit{Capabilities}, which lists the types of

405

attributes and classes the object can handle.

406

407

At the bottom of the GenericObjectEditor dialog are four buttons. The first

408

two, \textbf{Open...} and \textbf{Save...} allow object configurations to be

409

stored for future use. The \textbf{Cancel} button backs out without remembering

410

any changes that have been made. Once you are happy with the object and

411

settings you have chosen, click \textbf{OK} to return to the main Explorer

412

window.

413

414

\subsubsection*{Applying Filters}

415

416

Once you have selected and configured a filter, you can apply it to

417

the data by pressing the \textbf{Apply} button at the right end of the

418

\textbf{Filter} panel in the Preprocess panel. The Preprocess panel

419

will then show the transformed data. The change can be undone by

420

pressing the \textbf{Undo} button. You can also use the

421

\textbf{Edit...} button to modify your data manually in a dataset

422

editor. Finally, the \textbf{Save...} button at the top right of the

423

Preprocess panel saves the current version of the relation in file

424

formats that can represent the relation, allowing it to be kept for future

425

use. \\

426

427

\noindent \textbf{Note:} Some of the filters behave differently

428

depending on whether a class attribute has been set or not (using the

429

box above the histogram, which will bring up a drop-down list of

430

possible selections when clicked). In particular, the ``supervised

431

filters'' require a class attribute to be set, and some of the

432

``unsupervised attribute filters'' will skip the class attribute if

433

one is set. Note that it is also possible to set {\em Class} to {\em

434

None}, in which case no class is set.

435

436

\newpage

437

438

\section{Classification}

439

440

\begin{center}

441

\epsfig{file=images/explorer_classify.eps,height=7cm}

442

\end{center}

443

444

\subsection{Selecting a Classifier}

445

446

\label{sec:classifier}

447

At the top of the classify section is the \textbf{Classifier}

448

box. This box has a text field that gives the name of the currently

449

selected classifier, and its options. Clicking on the text box with

450

the left mouse button brings up a GenericObjectEditor dialog box,

451

just the same as for filters, that you can use to configure the options

452

of the current classifier. With a \textit{right click} (or

453

\textit{Alt+Shift+left click}) you can once again copy

454

the setup string to the clipboard or display the properties in a

455

GenericObjectEditor dialog box.

456

The \textbf{Choose} button allows you to choose one of the classifiers

457

that are available in WEKA.

458

459

\subsection{Test Options}

460

461

The result of applying the chosen classifier will be tested according to the

462

options that are set by clicking in the \textbf{Test options} box. There are

463

four test modes:

464

465

\begin{enumerate}

466

\item \textbf{Use training set}.

467

The classifier is evaluated on how well it predicts the class of the instances

468

it was trained on.

469

\item \textbf{Supplied test set}.

470

The classifier is evaluated on how well it predicts the class of a set of

471

instances loaded from a file. Clicking the \textbf{Set...} button brings up a

472

dialog allowing you to choose the file to test on.

473

\item \textbf{Cross-validation}.

474

The classifier is evaluated by cross-validation, using the number of folds that

475

are entered in the \textbf{Folds} text field.

476

\item \textbf{Percentage split}.

477

The classifier is evaluated on how well it predicts a certain percentage of the

478

data which is held out for testing. The amount of data held out depends on the

479

value entered in the \textbf{\%} field.

480

\end{enumerate}

481

\noindent

482

\textbf{Note:} No matter which evaluation method is used, the model

483

that is output is always the one build from \textbf{\em all} the training data.

484

\noindent

485

Further testing options can be set by clicking on the \textbf{More options...}

486

button:

487

488

\begin{enumerate}

489

\item \textbf{Output model}.

490

The classification model on the full training set is output so that it can be

491

viewed, visualized, etc. This option is selected by default.

492

\item \textbf{Output per-class stats}. The precision/recall and

493

true/false statistics for each class are output. This option is also

494

selected by default.

495

\item \textbf{Output entropy evaluation measures}. Entropy evaluation

496

measures are included in the output. This option is not selected by

497

default.

498

\item \textbf{Output confusion matrix}.

499

The confusion matrix of the classifier's predictions is included in the output.

500

This option is selected by default.

501

\item \textbf{Store predictions for visualization}. The classifier's

502

predictions are remembered so that they can be visualized. This option

503

is selected by default.

504

\item \textbf{Output predictions}. The predictions on the evaluation

505

data are output. Note that in the case of a cross-validation the

506

instance numbers do not correspond to the location in the data!

507

\item \textbf{Output additional attributes}. If additional attributes need to

508

be output alongside the predictions, e.g., an ID attribute for tracking

509

misclassifications, then the index of this attribute can be specified here. The

510

usual Weka ranges are supported,``first'' and ``last'' are therefore valid

511

indices as well (example: ``first-3,6,8,12-last'').

512

\item \textbf{Cost-sensitive evaluation}.

513

The errors is evaluated with respect to a cost matrix. The \textbf{Set...}

514

button allows you to specify the cost matrix used.

515

\item \textbf{Random seed for xval / \% Split}.

516

This specifies the random seed used when randomizing the data before it is

517

divided up for evaluation purposes.

518

\item \textbf{Preserve order for \% Split}.

519

This suppresses the randomization of the data before splitting into train

520

and test set.

521

\item \textbf{Output source code}.

522

If the classifier can output the built model as Java source code, you can

523

specify the class name here. The code will be printed in the ``Classifier

524

output'' area.

525

\end{enumerate}

526

527

\subsection{The Class Attribute}

528

529

The classifiers in WEKA are designed to be trained to predict a single `class'

530

attribute, which is the target for prediction. Some classifiers can only learn

531

nominal classes; others can only learn numeric classes (regression problems);

532

still others can learn both.

533

534

By default, the class is taken to be the last attribute in the data. If you

535

want to train a classifier to predict a different attribute, click on the box

536

below the \textbf{Test options} box to bring up a drop-down list of attributes

537

to choose from.

538

539

\subsection{Training a Classifier}

540

541

Once the classifier, test options and class have all been set, the learning

542

process is started by clicking on the \textbf{Start} button. While the

543

classifier is busy being trained, the little bird moves around. You can stop

544

the training process at any time by clicking on the \textbf{Stop} button.

545

546

When training is complete, several things happen. The \textbf{Classifier

547

output} area to the right of the display is filled with text describing the

548

results of training and testing. A new entry appears in the \textbf{Result

549

list} box. We look at the result list below; but first we investigate the text

550

that has been output.

551

552

\subsection{The Classifier Output Text}

553

554

The text in the \textbf{Classifier output} area has scroll bars allowing you to

555

browse the results. Clicking with the left mouse button into the text area,

556

while holding \texttt{Alt} and \texttt{Shift}, brings up a dialog that enables

557

you to save the displayed output in a variety of formats (currently, BMP, EPS, JPEG and PNG).

558

Of course, you can also resize the Explorer window to get a larger display area.

559

The output is split into several sections:

560

561

\begin{enumerate}

562

\item \textbf{Run information}.

563

A list of information giving the learning scheme options, relation name,

564

instances, attributes and test mode that were involved in the process.

565

\item \textbf{Classifier model (full training set)}.

566

A textual representation of the classification model that was produced on the

567

full training data.

568

\item The results of the chosen test mode are broken down thus:

569

\item \textbf{Summary}.

570

A list of statistics summarizing how accurately the classifier was able to

571

predict the true class of the instances under the chosen test mode.

572

\item \textbf{Detailed Accuracy By Class}.

573

A more detailed per-class break down of the classifier's prediction accuracy.

574

\item \textbf{Confusion Matrix}.

575

Shows how many instances have been assigned to each class. Elements show the

576

number of test examples whose actual class is the row and whose predicted class

577

is the column.

578

\item \textbf{Source code} (optional).

579

This section lists the Java source code if one chose ``Output source code'' in

580

the ``More options'' dialog.

581

\end{enumerate}

582

583

\subsection{The Result List}

584

585

After training several classifiers, the result list will contain several

586

entries. Left-clicking the entries flicks back and forth between the various

587

results that have been generated. Pressing \texttt{Delete} removes a selected

588

entry from the results. Right-clicking an entry invokes a menu containing these items:

589

590

\begin{enumerate}

591

\item \textbf{View in main window}.

592

Shows the output in the main window (just like left-clicking the entry).

593

\item \textbf{View in separate window}.

594

Opens a new independent window for viewing the results.

595

\item \textbf{Save result buffer}.

596

Brings up a dialog allowing you to save a text file containing the textual

597

output.

598

\item \textbf{Load model}.

599

Loads a pre-trained model object from a binary file.

600

\item \textbf{Save model}.

601

Saves a model object to a binary file. Objects are saved in Java `serialized

602

object' form.

603

\item \textbf{Re-evaluate model on current test set}.

604

Takes the model that has been built and tests its performance on the data set

605

that has been specified with the \textbf{Set..} button under the

606

\textbf{Supplied test set} option.

607

\item \textbf{Visualize classifier errors}.

608

Brings up a visualization window that plots the results of classification.

609

Correctly classified instances are represented by crosses, whereas incorrectly

610

classified ones show up as squares.

611

\item \textbf{Visualize tree} or \textbf{Visualize graph}. Brings up

612

a graphical representation of the structure of the classifier model,

613

if possible (i.e. for decision trees or Bayesian networks). The

614

graph visualization option only appears if a Bayesian network

615

classifier has been built. In the tree visualizer, you can bring up a

616

menu by right-clicking a blank area, pan around by dragging the mouse,

617

and see the training instances at each node by clicking on

618

it. CTRL-clicking zooms the view out, while SHIFT-dragging a box zooms

619

the view in. The graph visualizer should be self-explanatory.

620

\item \textbf{Visualize margin curve}.

621

Generates a plot illustrating the prediction margin. The margin is defined as

622

the difference between the probability predicted for the actual class and the

623

highest probability predicted for the other classes. For example, boosting

624

algorithms may achieve better performance on test data by increasing the

625

margins on the training data.

626

\item \textbf{Visualize threshold curve}.

627

Generates a plot illustrating the trade-offs in prediction that are obtained by

628

varying the threshold value between classes. For example, with the default

629

threshold value of 0.5, the predicted probability of `positive' must be greater

630

than 0.5 for the instance to be predicted as `positive'. The plot can be used

631

to visualize the precision/recall trade-off, for ROC curve analysis (true

632

positive rate {\em vs} false positive rate), and for other types of curves.

633

\item \textbf{Visualize cost curve}.

634

Generates a plot that gives an explicit representation of the expected cost, as

635

described by \cite{drummond}.

636

\item \textbf{Plugins}.

637

This menu item only appears if there are visualization plugins

638

available (by default: none). More about these plugins can be found in the

639

\textit{WekaWiki} article ``Explorer visualization plugins'' \cite{explorervisplugins}.

640

\end{enumerate}

641

\noindent

642

Options are greyed out if they do not apply to the specific set of results.

643

644

\newpage

645

646

\section{Clustering}

647

648

\begin{center}

649

\epsfig{file=images/explorer_cluster.eps,height=7cm}

650

\end{center}

651

652

\subsection{Selecting a Clusterer}

653

654

By now you will be familiar with the process of selecting and configuring

655

objects. Clicking on the clustering scheme listed in the \textbf{Clusterer}

656

box at the top of the window brings up a GenericObjectEditor dialog with which

657

to choose a new clustering scheme.

658

659

\subsection{Cluster Modes}

660

661

The \textbf{Cluster mode} box is used to choose what to cluster and how to

662

evaluate the results. The first three options are the same as for

663

classification: \textbf{Use training set}, \textbf{Supplied test set} and

664

\textbf{Percentage split} (Section~\ref{sec:classifier})---except that now the

665

data is assigned to clusters instead of trying to predict a specific class.

666

The fourth mode, \textbf{Classes to clusters evaluation}, compares how well the

667

chosen clusters match up with a pre-assigned class in the data. The drop-down

668

box below this option selects the class, just as in the \textbf{Classify}

669

panel.

670

671

An additional option in the \textbf{Cluster mode} box, the \textbf{Store

672

clusters for visualization} tick box, determines whether or not it will be

673

possible to visualize the clusters once training is complete. When dealing

674

with datasets that are so large that memory becomes a problem it may be

675

helpful to disable this option.

676

677

\subsection{Ignoring Attributes}

678

679

Often, some attributes in the data should be ignored when clustering. The

680

\textbf{Ignore attributes} button brings up a small window that allows you to

681

select which attributes are ignored. Clicking on an attribute in the window

682

highlights it, holding down the SHIFT key selects a range of consecutive

683

attributes, and holding down CTRL toggles individual attributes on and off. To

684

cancel the selection, back out with the \textbf{Cancel} button. To activate it,

685

click the \textbf{Select} button. The next time clustering is invoked, the

686

selected attributes are ignored.

687

688

\subsection{Working with Filters}

689

690

The \texttt{FilteredClusterer} meta-clusterer offers the user the possibility

691

to apply filters directly before the clusterer is learned. This approach

692

eliminates the manual application of a filter in the \textbf{Preprocess} panel,

693

since the data gets processed on the fly. Useful if one needs to try out

694

different filter setups.

695

696

\subsection{Learning Clusters}

697

698

The \textbf{Cluster} section, like the \textbf{Classify} section, has

699

\textbf{Start}/\textbf{Stop} buttons, a result text area and a result

700

list. These all behave just like their classification counterparts.

701

Right-clicking an entry in the result list brings up a similar menu,

702

except that it shows only two visualization options: \textbf{Visualize

703

cluster assignments} and \textbf{Visualize tree}. The latter is grayed

704

out when it is not applicable.

705

706

\newpage

707

708

\section{Associating}

709

710

\begin{center}

711

\epsfig{file=images/explorer_associate.eps,height=7cm}

712

\end{center}

713

714

\subsection{Setting Up}

715

716

This panel contains schemes for learning association rules, and the

717

learners are chosen and configured in the same way as the clusterers,

718

filters, and classifiers in the other panels.

719

720

\subsection{Learning Associations}

721

722

Once appropriate parameters for the association rule learner bave been

723

set, click the \textbf{Start} button. When complete, right-clicking

724

on an entry in the result list allows the results to be viewed or

725

saved.

726

727

\newpage

728

729

\section{Selecting Attributes}

730

731

\begin{center}

732

\epsfig{file=images/explorer_selectattributes.eps,height=7cm}

733

\end{center}

734

735

\subsection{Searching and Evaluating}

736

737

Attribute selection involves searching through all possible combinations of

738

attributes in the data to find which subset of attributes works best for

739

prediction. To do this, two objects must be set up: an attribute evaluator and

740

a search method. The evaluator determines what method is used to assign a

741

worth to each subset of attributes. The search method determines what style of

742

search is performed.

743

744

\subsection{Options}

745

746

The \textbf{Attribute Selection Mode} box has two options:

747

748

\begin{enumerate}

749

\item \textbf{Use full training set}.

750

The worth of the attribute subset is determined using the full set of training

751

data.

752

\item \textbf{Cross-validation}.

753

The worth of the attribute subset is determined by a process of

754

cross-validation. The \textbf{Fold} and \textbf{Seed} fields set the number of

755

folds to use and the random seed used when shuffling the data.

756

\end{enumerate}

757

\noindent

758

As with \textbf{Classify} (Section~\ref{sec:classifier}), there is a drop-down

759

box that can be used to specify which attribute to treat as the class.

760

761

\subsection{Performing Selection}

762

763

Clicking \textbf{Start} starts running the attribute selection

764

process. When it is finished, the results are output into the result

765

area, and an entry is added to the result list. Right-clicking on the

766

result list gives several options. The first three, (\textbf{View in

767

main window}, \textbf{View in separate window} and \textbf{Save result

768

buffer}), are the same as for the classify panel. It is also possible

769

to \textbf{Visualize reduced data}, or if you have used an attribute

770

transformer such as PrincipalComponents, \textbf{Visualize transformed

771

data}. The reduced/transformed data can be saved to a file with the

772

\textbf{Save reduced data...} or \textbf{Save transformed data...}

773

option.

774

775

In case one wants to reduce/transform a training and a test at the same

776

time and not use the AttributeSelectedClassifier from the classifier

777

panel, it is best to use the AttributeSelection filter (a supervised

778

attribute filter) in batch mode ('\texttt{-b}')

779

from the command line or in the SimpleCLI. The batch mode allows one to

780

specify an additional input and output file pair (options \texttt{-r}

781

and \texttt{-s}), that is processed with the filter setup that

782

was determined based on the training data (specified by options

783

\texttt{-i} and \texttt{-o}). \\

784

785

\noindent Here is an example for a Unix/Linux bash:

786

\begin{verbatim}

787

java weka.filters.supervised.attribute.AttributeSelection \

788

-E "weka.attributeSelection.CfsSubsetEval " \

789

-S "weka.attributeSelection.BestFirst -D 1 -N 5" \

790

-b \

791

-i <input1.arff> \

792

-o <output1.arff> \

793

-r <input2.arff> \

794

-s <output2.arff>

795

\end{verbatim}

796

797

\noindent \textbf{Notes:}

798

\begin{itemize}

799

\item The ``backslashes'' at the end of each line tell the bash that

800

the command is not finished yet. Using the SimpleCLI one has to

801

use this command in one line without the backslashes.

802

803

\item It is assumed that WEKA is available in the \texttt{CLASSPATH},

804

otherwise one has to use the \texttt{-classpath} option.

805

806

\item The full filter setup is output in the log, as well as

807

the setup for running regular attribute selection.

808

\end{itemize}

809

810

\newpage

811

812

\section{Visualizing}

813

814

\begin{center}

815

\epsfig{file=images/explorer_visualization.eps,height=7cm}

816

\end{center}

817

818

WEKA's visualization section allows you to visualize 2D plots of the

819

current relation.

820

821

\subsection{The scatter plot matrix}

822

823

When you select the {\em Visualize} panel, it shows a scatter plot

824

matrix for all the attributes, colour coded according to the currently

825

selected class. It is possible to change the size of each individual

826

2D plot and the point size, and to randomly jitter the data (to

827

uncover obscured points). It also possible to change the attribute

828

used to colour the plots, to select only a subset of attributes for

829

inclusion in the scatter plot matrix, and to sub sample the data. Note

830

that changes will only come into effect once the \textbf{Update}

831

button has been pressed.

832

833

\subsection{Selecting an individual 2D scatter plot}

834

835

When you click on a cell in the scatter plot matrix, this will bring

836

up a separate window with a visualization of the scatter plot you

837

selected. (We described above how to visualize particular results in

838

a separate window---for example, classifier errors---the same

839

visualization controls are used here.)

840

841

Data points are plotted in the main area of the window. At the top

842

are two drop-down list buttons for selecting the axes to plot. The

843

one on the left shows which attribute is used for the x-axis; the one

844

on the right shows which is used for the y-axis.

845

846

Beneath the x-axis selector is a drop-down list for choosing the

847

colour scheme. This allows you to colour the points based on the

848

attribute selected. Below the plot area, a legend describes what

849

values the colours correspond to. If the values are discrete, you can

850

modify the colour used for each one by clicking on them and making an

851

appropriate selection in the window that pops up.

852

853

To the right of the plot area is a series of horizontal strips. Each

854

strip represents an attribute, and the dots within it show the

855

distribution of values of the attribute. These values are randomly

856

scattered vertically to help you see concentrations of points. You

857

can choose what axes are used in the main graph by clicking on these

858

strips. Left-clicking an attribute strip changes the x-axis to that

859

attribute, whereas right-clicking changes the y-axis. The `X' and `Y'

860

written beside the strips shows what the current axes are (`B' is used

861

for `both X and Y').

862

863

Above the attribute strips is a slider labelled \textbf{Jitter}, which

864

is a random displacement given to all points in the plot. Dragging it

865

to the right increases the amount of jitter, which is useful for

866

spotting concentrations of points. Without jitter, a million instances

867

at the same point would look no different to just a single lonely

868

instance.

869

870

\subsection{Selecting Instances}

871

872

There may be situations where it is helpful to select a subset of the

873

data using the visualization tool. (A special case of this is the

874

UserClassifier in the {\em Classify} panel, which lets you build your

875

own classifier by interactively selecting instances.)

876

877

Below the y-axis selector button is a drop-down list button for choosing a

878

selection method. A group of data points can be selected in four ways:

879

880

\begin{enumerate}

881

\item \textbf{Select Instance}.

882

Clicking on an individual data point brings up a window listing its attributes.

883

If more than one point appears at the same location, more than one set of

884

attributes is shown.

885

\item \textbf{Rectangle}.

886

You can create a rectangle, by dragging, that selects the points inside it.

887

\item \textbf{Polygon}.

888

You can build a free-form polygon that selects the points inside it. Left-click

889

to add vertices to the polygon, right-click to complete it. The polygon will

890

always be closed off by connecting the first point to the last.

891

\item \textbf{Polyline}.

892

You can build a polyline that distinguishes the points on one side from those

893

on the other. Left-click to add vertices to the polyline, right-click to

894

finish. The resulting shape is open (as opposed to a polygon, which is

895

always closed).

896

\end{enumerate}

897

898

Once an area of the plot has been selected using \textbf{Rectangle},

899

\textbf{Polygon} or \textbf{Polyline}, it turns grey. At this point, clicking

900

the \textbf{Submit} button removes all instances from the plot except those

901

within the grey selection area. Clicking on the \textbf{Clear} button erases

902

the selected area without affecting the graph.

903

904

Once any points have been removed from the graph, the \textbf{Submit} button

905

changes to a \textbf{Reset} button. This button undoes all previous removals

906

and returns you to the original graph with all points included. Finally,

907

clicking the \textbf{Save} button allows you to save the currently visible

908

instances to a new ARFF file.

909

910

\newpage

911

912

\begin{thebibliography}{999}

913

\bibitem{drummond} Drummond, C. and Holte, R. (2000) Explicitly representing expected cost: An alternative to ROC representation. \textit{Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.}

914

Publishers, San Mateo, CA.

915

\bibitem{witten} Witten, I.H. and Frank, E. (2005) \textit{Data Mining: Practical machine

916

learning tools and techniques. 2nd edition} Morgan Kaufmann, San

917

Francisco.

918

\bibitem{wekawiki} \textit{WekaWiki} -- \texttt{http://weka.sourceforge.net/wiki/}

919

\bibitem{wekadoc} \textit{WekaDoc} -- \texttt{http://weka.sourceforge.net/wekadoc/}

920

\bibitem{ensemble} \textit{Ensemble Selection} on \textit{WekaDoc} -- \\ \small{\texttt{http://weka.sourceforge.net/wekadoc/index.php/en:Ensemble\_Selection}}

921

\bibitem{mainextensions} \textit{Extensions for Weka's main GUI} on \textit{WekaWiki} -- \\

922

\small{\texttt{http://weka.sourceforge.net/wiki/index.php/Extensions\_for\_Weka\%27s\_main\_GUI}}

923

\bibitem{explorertabs} \textit{Adding tabs in the Explorer} on \textit{WekaWiki} -- \\

924

\small{\texttt{http://weka.sourceforge.net/wiki/index.php/Adding\_tabs\_in\_the\_Explorer}}

925

\bibitem{explorervisplugins} \textit{Explorer visualization plugins} on \textit{WekaWiki} -- \\

926

\small{\texttt{http://weka.sourceforge.net/wiki/index.php/Explorer\_visualization\_plugins}}

927

\end{thebibliography}

928

929

\end{document}

Older »