en:Experimenter - Standard Experiments - Simple (3.5.4)

From WekaDoc

Table of contents

New experiment

After clicking New default parameters for an Experiment are defined.

Enlarge

Results destination

By default, an ARFF file is the destination for the results output. But you can choose between

  • ARFF file
  • CSV file
  • JDBC database

ARFF file and JDBC database are discussed in detail in the following sections. CSV is similar to ARFF, but it can be used to be loaded in an external spreadsheet application.

ARFF file

If the file name is left empty a temporary file will be created in the TEMP directory of the system. If one wants to specify an explicit results file, click on Browse and choose a filename, e.g., Experiment1.arff.

Enlarge

Click on Save and the name will appear in the edit field next to ARFF file.

Enlarge

The advantage of ARFF or CSV files: they can be created without any additional classes besides the ones from Weka. The drawback is the lack of the ability to resume an interrupted experiment, e.g., due to an error or the addition of data set or algorithms. Especially with time-consuming experiments, this behavior can be annoying.

JDBC database

With JDBC it is easy to store the results in a database. The necessary jar archives have to be in the CLASSPATH to make the JDBC functionality of a particular database available.

After changing ARFF file to JDBC database click on User... to specify JDBC URL and user credentials for accessing the database.

Enlarge

After supplying the necessary data and clicking on OK, the URL in the main window will be updated.

Note: at this point, the database connection is not tested; this is done, when the experiment is started.

Enlarge

The advantage of a JDBC database is the possibility to resume an interrupted or extended experiment. Instead of re-running all the other algorithm/dataset combinations again, only the missing ones are computed.

Experiment type

The user can choose between the following three different types

  • Cross-validation (default)
    performs stratified cross-validation with the given number of folds
  • Train/Test Percentage Split (data randomized)
    splits a dataset according to the percentage into a train and a test file (one cannot specify explicit training and test files in the Experimenter), after the order of the data has been randomized and stratified
Enlarge
  • Train/Test Percentage Split (order preserved)
    because it is impossible to specify an explicit train/test files pair, one can abuse this type to un-merge previously merged train and test file into the two original files (one only needs to find out the correct percentage)
Enlarge

Additionally, one can choose between Classification and Regression, depending on the datasets and classifiers one uses. For decision trees like J48 (Weka's implementation of Quinlan's C4.5) and the iris dataset, Classification is necessary, for a numeric classifier like M5P, on the other hand, Regression. Classification is selected by default.

Note: if the percentage splits are used, one has to make sure that the corrected paired T-Tester still produces sensible results with the given ratio (Y. Bengio, C. Nadeau: Inference for the Generalization Error, 1999. (http://econpapers.repec.org/paper/circirwor/99s-25.htm)).

Datasets

One can add dataset files either with an absolute path of with a relative one. The latter makes it often easier to run experiments on different machines, hence one should check Use relative paths, before clicking on Add new....

Enlarge

In this example, open the data directory and choose the iris.arff dataset.

Enlarge

After clicking Open the file will be displayed in the datasets list. If one selects a directory and hits Open, then all ARFF files will added recursively. Files can be deleted from the list by selecting them and then clicking on Delete selected.

ARFF files are not the only format one can load, but all files that can be converted with Weka's "core converters". The following formats are currently supported:

  • ARFF (+ compressed)
  • C4.5
  • CSV
  • libsvm
  • binary serialized instances
  • XRFF (+ compressed)

By default, the class attribute is assumed to be the last attribute. But if a data format contains information about the class attribute, like XRFF or C4.5, this attribute will be used instead.

Enlarge

Iteration control

  • Number of repetitions
    In order to get statistically meaningful results, the default iteration number is 10. In case of 10-fold cross-validation this means 100 calls of one classifier with training data and tested against test data.
  • Data sets first/Algorithms first
    As soon as one has more than one dataset and algorithm, it can be useful to switch from datasets being iterated over first to algorithms. This is the case if one stores the results in a database and wants to complete the results for all the datasets for one algorithm as early as possible.

Algorithms

New algorithms can be added via the Add new... button. The first time opening this dialog, ZeroR is presented, otherwise the one that was selected last.

Enlarge

With the Choose button one can open the GenericObjectEditor and choose another classifier.

Enlarge

The Filter... button enables one to highlight classifiers that can handle certain attribute and class types. With the Remove filter button all the selected capabilities will get cleared and the highlighting removed again.

Additional algorithms can be added again with the Add new... button, e.g. the J48 decision tree.

Enlarge

After setting the classifier parameters, one clicks on OK to add it to the list of algorithms.

Enlarge

With the Load options... and Save options... buttons one can load and save the setup of a selected classifier from and to XML. This is especially useful for highly configured classifiers (e.g., nested meta-classifiers), where the manual setup takes quite some time, and which are used often.

One can also paste classifier settings here by right-clicking (or Alt-Shift-left-clicking) and selecting the appropriate menu point from the popup menu, to either add a new classifier or replace the selected one with a new setup. This is rather useful for transferring a classifier setup from the Weka Explorer over to the Experimenter without having to setup the classifier from scratch.

Saving the setup

For future re-use, one can save the current setup of the experiment to a file by clicking on Save... at the top of the window.

Enlarge

By default, the format of the experiment files is the binary format that Java serialization offers. The drawback of this format is the possible incompatibility between different versions of Weka. A more robust alternative to the binary format is the XML format.

Previously saved experiments can be loaded again via the Open... button.

Running an Experiment

To run the current experiment, click the Run tab at the top of the Experiment Environment window. The current experiment performs 10 runs of 10-fold stratified cross-validation on the Iris dataset using the ZeroR and J48 scheme.

Enlarge

Click Start to run the experiment.

Enlarge

If the experiment was defined correctly, the 3 messages shown above will be displayed in the Log panel. The results of the experiment are saved to the dataset Experiment1.arff.