en:Explorer - Preprocessing (3.5.6)
From WekaDoc
En-ExplorerGuide_Preprocessing-353.png
| Table of contents |
Opening files
The first four buttons at the top of the preprocess section enable you to load data into WEKA:
- Open file.... Brings up a dialog box allowing you to browse for the data file on the local filesystem.
- Open URL.... Asks for a Uniform Resource Locator address for where the data is stored.
- Open DB.... Reads data from a database. (Note that to make this work you might have to edit the file in weka/experiment/DatabaseUtils.props.)
- Generate.... Enables you to generate artificial data from a variety of DataGenerators.
Using the Open file... button you can read files in a variety of formats: Weka's ARFF format, CSV format, C4.5 format, or serialized Instances format. ARFF files typically have a .arff extension, CSV files a .csv extension, C4.5 files a .data and .names extension, and serialized Instances objects a .bsi extension.
NB: This list of formats can be extended by adding custom file converters to the weka.core.converters package.
The Current Relation
Once some data has been loaded, the Preprocess panel shows a variety of information. The Current relation box (the "current relation" is the currently loaded data, which can be interpreted as a single relational table in database terminology) has three entries:
- Relation. The name of the relation, as given in the file it was loaded from. Filters (described below) modify the name of a relation.
- Instances. The number of instances (data points/records) in the data.
- Attributes. The number of attributes (features) in the data.
Working With Attributes
Below the Current relation box is a box titled Attributes. There are four buttons, and beneath them is a list of the attributes in the current relation. The list has three columns:
- No.. A number that identifies the attribute in the order they are specified in the data file.
- Selection tick boxes. These allow you select which attributes are present in the relation.
- Name. The name of the attribute, as it was declared in the data file.
When you click on different rows in the list of attributes, the fields change in the box to the right titled Selected attribute. This box displays the characteristics of the currently highlighted attribute in the list:
- Name. The name of the attribute, the same as that given in the attribute list.
- Type. The type of attribute, most commonly Nominal or Numeric.
- Missing. The number (and percentage) of instances in the data for which this attribute is missing (unspecified).
- Distinct. The number of different values that the data contains for this attribute.
- Unique. The number (and percentage) of instances in the data having a value for this attribute that no other instances have.
Below these statistics is a list showing more information about the values stored in this attribute, which differ depending on its type. If the attribute is nominal, the list consists of each possible value for the attribute along with the number of instances that have that value. If the attribute is numeric, the list gives four statistics describing the distribution of values in the data---the minimum, maximum, mean and standard deviation. And below these statistics there is a colored histogram, color-coded according to the attribute chosen as the Class using the box above the histogram. (This box will bring up a drop-down list of available selections when clicked.) Note that only nominal Class attributes will result in a color-coding. Finally, after pressing the Visualize All button, histograms for all the attributes in the data are shown in a separate witting.
Returning to the attribute list, to begin with all the tick boxes are unticked. They can be toggled on/off by clicking on them individually. The four buttons above can also be used to change the selection:
- All. All boxes are ticked.
- None. All boxes are cleared (unticked).
- Invert. Boxes that are ticked become unticked and vice versa.
- Pattern. Enables the user to select attributes based on a Perl 5 Regular Expression. E.g.,
.*_idselects all attributes which name ends with_id.
Once the desired attributes have been selected, they can be removed by clicking the Remove button below the list of attributes. Note that this can be undone by clicking the Undo button, which is located next to the Edit button in the top-right corner of the Preprocess panel.
Working With Filters
The preprocess section allows filters to be defined that transform the data in various ways. The Filter box is used to set up the filters that are required. At the left of the Filter box is a Choose button. By clicking this button it is possible to select one of the filters in Weka. Once a filter has been selected, its name and options are shown in the field next to the Choose button. Clicking on this box with the \textit{left} mouse button brings up a GenericObjectEditor dialog box. A click with the right mouse button (or Alt+Shift+left click) brings up a menu where you can choose, either to display the properties in a GenericObjectEditor dialog box, or to copy the current setup string to the clipboard.
The GenericObjectEditor Dialog Box
The GenericObjectEditor dialog box lets you configure a filter. The same kind of dialog box is used to configure other objects, such as classifiers and clusterers (see below). The fields in the window reflect the available options.
Right-clicking (or Alt+Shift+Left-Click) on such a field will bring up a popup menu, listing the following options:
- Show properties... has the same effect as left-clicking on the field, i.e., a dialog appears allowing you to alter the settings.
- Copy configuration to clipboard copies the currently displayed configuration string to the system’s clipboard and therefore can be used anywhere else in WEKA or in the console. This is rather handy if you have to setup complicated, nested schemes.
- Enter configuration... is the receiving end for configurations that got copied to the clipboard earlier on. In this dialog you can enter a classname followed by options (if the class supports these). This also allows you to transfer a filter setting from the Preprocess panel to a
FilteredClassifierused in the Classify panel.
Clicking on any of these gives an opportunity to alter the filters settings. For example, the setting may take a text string, in which case you type the string into the text field provided. Or it may give a drop-down box listing several states to choose from. Or it may do something else, depending on the information required. Information on the options is provided in a tool tip if you let the mouse pointer hover of the corresponding field. More information on the filter and its options can be obtained by clicking on the More button in the About panel at the top of the GenericObjectEditor window.
Some objects display a brief description of what they do in an About box, along with a More button. Clicking on the More button brings up a window describing what the different options do. Others have an additional button, Capabilities, which lists the types of attributes and classes the object can handle.
At the bottom of the GenericObjectEditor dialog are four buttons. The first two, Open... and Save... allow object configurations to be stored for future use. The Cancel button backs out without remembering any changes that have been made. Once you are happy with the object and settings you have chosen, click OK to return to the main Explorer window.
Applying Filters
Once you have selected and configured a filter, you can apply it to the data by pressing the Apply button at the right end of the Filter panel in the Preprocess panel. The Preprocess panel will then show the transformed data. The change can be undone by pressing the Undo button. You can also use the Edit... button to modify your data manually in a dataset editor. Finally, the Save... button at the top right of the Preprocess panel saves the current version of the relation in the same formats that can represent the relation, allowing it to be kept for future use.
Note: Some of the filters behave differently depending on whether a class attribute has been set or not (using the box above the histogram, which will bring up a drop-down list of possible selections when clicked). In particular, the ``supervised filters require a class attribute to be set, and some of the ``unsupervised attribute filters will skip the class attribute if one is set. Note that it is also possible to set Class to None, in which case no class is set.
Editing
You can also view the current dataset in a tabular format via the Edit... button. Clicking this button opens dialog of the ArffViewer, displaying the currently loaded data. You can edit the data, delete and rename attributes, delete instances and undo modifications. But the modifications are only applied if you click on the OK button and return to the main Explorer window.
| Missing image En-ExplorerGuide_Editing-353.png | Missing image En-ExplorerGuide_Editing_Menu-353.png | Missing image En-ExplorerGuide_Editing_NominalValue-353.png |
| Tabular view | Table header popup | Editing a nominal value |
