weka.clusterers
Class XMeans

java.lang.Object
  extended by weka.clusterers.AbstractClusterer
      extended by weka.clusterers.RandomizableClusterer
          extended by weka.clusterers.XMeans
All Implemented Interfaces:
java.io.Serializable, java.lang.Cloneable, Clusterer, CapabilitiesHandler, OptionHandler, Randomizable, RevisionHandler, TechnicalInformationHandler

public class XMeans
extends RandomizableClusterer
implements TechnicalInformationHandler

Cluster data using the X-means algorithm.

X-Means is K-Means extended by an Improve-Structure part In this part of the algorithm the centers are attempted to be split in its region. The decision between the children of each center and itself is done comparing the BIC-values of the two structures.

For more information see:

Dan Pelleg, Andrew W. Moore: X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In: Seventeenth International Conference on Machine Learning, 727-734, 2000.

BibTeX:

 @inproceedings{Pelleg2000,
    author = {Dan Pelleg and Andrew W. Moore},
    booktitle = {Seventeenth International Conference on Machine Learning},
    pages = {727-734},
    publisher = {Morgan Kaufmann},
    title = {X-means: Extending K-means with Efficient Estimation of the Number of Clusters},
    year = {2000}
 }
 

Valid options are:

 -I <num>
  maximum number of overall iterations
  (default 1).
 -M <num>
  maximum number of iterations in the kMeans loop in
  the Improve-Parameter part 
  (default 1000).
 -J <num>
  maximum number of iterations in the kMeans loop
  for the splitted centroids in the Improve-Structure part 
  (default 1000).
 -L <num>
  minimum number of clusters
  (default 2).
 -H <num>
  maximum number of clusters
  (default 4).
 -B <value>
  distance value for binary attributes
  (default 1.0).
 -use-kdtree
  Uses the KDTree internally
  (default no).
 -K <KDTree class specification>
  Full class name of KDTree class to use, followed
  by scheme options.
  eg: "weka.core.neighboursearch.kdtrees.KDTree -P"
  (default no KDTree class used).
 -C <value>
  cutoff factor, takes the given percentage of the splitted 
  centroids if none of the children win
  (default 0.0).
 -D <distance function class specification>
  Full class name of Distance function class to use, followed
  by scheme options.
  (default weka.core.EuclideanDistance).
 -N <file name>
  file to read starting centers from (ARFF format).
 -O <file name>
  file to write centers to (ARFF format).
 -U <int>
  The debug level.
  (default 0)
 -Y <file name>
  The debug vectors file.
 -S <num>
  Random number seed.
  (default 10)

Version:
$Revision: 8109 $
Author:
Gabi Schmidberger (gabi@cs.waikato.ac.nz), Mark Hall (mhall@cs.waikato.ac.nz), Malcolm Ware (mfw4@cs.waikato.ac.nz)
See Also:
RandomizableClusterer, Serialized Form

Field Summary
static int D_CONVCHCLOSER
          have a closer look at converge children.
static int D_CURR
          for current debug.
static int D_FOLLOWSPLIT
          follows the splitting of the centers.
static int D_GENERAL
          general debugging.
static int D_ITERCOUNT
          follow iterations.
static int D_KDTREE
          check on kdtree.
static int D_METH_MISUSE
          functions were maybe misused.
static int D_PRINTCENTERS
          print the centers.
static int D_RANDOMVECTOR
          check on random vectors.
 boolean m_CurrDebugFlag
          Flag: I'm debugging.
static int R_HIGH
          Index in ranges for HIGH.
static int R_LOW
          Index in ranges for LOW.
static int R_WIDTH
          Index in ranges for WIDTH.
 
Constructor Summary
XMeans()
          the default constructor.
 
Method Summary
 java.lang.String binValueTipText()
          Returns the tip text for this property.
 void buildClusterer(Instances data)
          Generates the X-Means clusterer.
 boolean checkForNominalAttributes(Instances data)
          Checks for nominal attributes in the dataset.
 int clusterInstance(Instance instance)
          Classifies a given instance.
 java.lang.String cutOffFactorTipText()
          Returns the tip text for this property.
 java.lang.String debugLevelTipText()
          Returns the tip text for this property.
 java.lang.String debugVectorsFileTipText()
          Returns the tip text for this property.
 java.lang.String distanceFTipText()
          Returns the tip text for this property.
 double getBinValue()
          Gets value that represents true in a new numeric attribute.
 Capabilities getCapabilities()
          Returns default capabilities of the clusterer.
 Instances getClusterCenters()
          Return the centers of the clusters as an Instances object
 double getCutOffFactor()
          Gets the cutoff factor.
 int getDebugLevel()
          Gets the debug level.
 java.io.File getDebugVectorsFile()
          Gets the file name for a file that has the random vectors stored.
 DistanceFunction getDistanceF()
          Gets the distance function.
 java.io.File getInputCenterFile()
          Gets the file to read the list of centers from.
 KDTree getKDTree()
          Gets the KDTree class.
 int getMaxIterations()
          Gets the maximum number of iterations.
 int getMaxKMeans()
          Gets the maximum number of iterations in KMeans.
 int getMaxKMeansForChildren()
          Gets the maximum number of iterations in KMeans.
 int getMaxNumClusters()
          Gets the maximum number of clusters to generate.
 int getMinNumClusters()
          Gets the minimum number of clusters to generate.
 Instance getNextDebugVectorsInstance(Instances model)
          Read an instance from debug vectors file.
 java.lang.String[] getOptions()
          Gets the current settings of SimpleKMeans.
 java.io.File getOutputCenterFile()
          Gets the file to write the list of centers to.
 java.lang.String getRevision()
          Returns the revision string.
 TechnicalInformation getTechnicalInformation()
          Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.
 boolean getUseKDTree()
          Gets whether the KDTree is used or not.
 java.lang.String globalInfo()
          Returns a string describing this clusterer.
 void initDebugVectorsInput()
          Initialises the debug vector input.
 java.lang.String inputCenterFileTipText()
          Returns the tip text for this property.
 java.lang.String KDTreeTipText()
          Returns the tip text for this property.
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
static void main(java.lang.String[] argv)
          Main method for testing this class.
 java.lang.String maxIterationsTipText()
          Returns the tip text for this property.
 java.lang.String maxKMeansForChildrenTipText()
          Returns the tip text for this property.
 java.lang.String maxKMeansTipText()
          Returns the tip text for this property.
 java.lang.String maxNumClustersTipText()
          Returns the tip text for this property.
 java.lang.String minNumClustersTipText()
          Returns the tip text for this property.
 int numberOfClusters()
          Returns the number of clusters.
 java.lang.String outputCenterFileTipText()
          Returns the tip text for this property.
 void setBinValue(double value)
          Sets the distance value between true and false of binary attributes.
 void setCutOffFactor(double i)
          Sets a new cutoff factor.
 void setDebugLevel(int d)
          Sets the debug level.
 void setDebugVectorsFile(java.io.File value)
          Sets the file that has the random vectors stored.
 void setDistanceF(DistanceFunction distanceF)
          gets the "binary" distance value.
 void setInputCenterFile(java.io.File value)
          Sets the file to read the list of centers from.
 void setKDTree(KDTree k)
          Sets the KDTree class.
 void setMaxIterations(int i)
          Sets the maximum number of iterations to perform.
 void setMaxKMeans(int i)
          Set the maximum number of iterations to perform in KMeans.
 void setMaxKMeansForChildren(int i)
          Sets the maximum number of iterations KMeans that is performed on the child centers.
 void setMaxNumClusters(int n)
          Sets the maximum number of clusters to generate.
 void setMinNumClusters(int n)
          Sets the minimum number of clusters to generate.
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setOutputCenterFile(java.io.File value)
          Sets file to write the list of centers to.
 void setUseKDTree(boolean value)
          Sets whether to use the KDTree or not.
 java.lang.String toString()
          Return a string describing this clusterer.
 java.lang.String useKDTreeTipText()
          Returns the tip text for this property.
 
Methods inherited from class weka.clusterers.RandomizableClusterer
getSeed, seedTipText, setSeed
 
Methods inherited from class weka.clusterers.AbstractClusterer
distributionForInstance, forName, makeCopies, makeCopy, runClusterer
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

R_LOW

public static int R_LOW
Index in ranges for LOW.


R_HIGH

public static int R_HIGH
Index in ranges for HIGH.


R_WIDTH

public static int R_WIDTH
Index in ranges for WIDTH.


D_PRINTCENTERS

public static int D_PRINTCENTERS
print the centers.


D_FOLLOWSPLIT

public static int D_FOLLOWSPLIT
follows the splitting of the centers.


D_CONVCHCLOSER

public static int D_CONVCHCLOSER
have a closer look at converge children.


D_RANDOMVECTOR

public static int D_RANDOMVECTOR
check on random vectors.


D_KDTREE

public static int D_KDTREE
check on kdtree.


D_ITERCOUNT

public static int D_ITERCOUNT
follow iterations.


D_METH_MISUSE

public static int D_METH_MISUSE
functions were maybe misused.


D_CURR

public static int D_CURR
for current debug.


D_GENERAL

public static int D_GENERAL
general debugging.


m_CurrDebugFlag

public boolean m_CurrDebugFlag
Flag: I'm debugging.

Constructor Detail

XMeans

public XMeans()
the default constructor.

Method Detail

globalInfo

public java.lang.String globalInfo()
Returns a string describing this clusterer.

Returns:
a description of the evaluator suitable for displaying in the explorer/experimenter gui

getTechnicalInformation

public TechnicalInformation getTechnicalInformation()
Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.

Specified by:
getTechnicalInformation in interface TechnicalInformationHandler
Returns:
the technical information about this class

getCapabilities

public Capabilities getCapabilities()
Returns default capabilities of the clusterer.

Specified by:
getCapabilities in interface Clusterer
Specified by:
getCapabilities in interface CapabilitiesHandler
Overrides:
getCapabilities in class AbstractClusterer
Returns:
the capabilities of this clusterer

buildClusterer

public void buildClusterer(Instances data)
                    throws java.lang.Exception
Generates the X-Means clusterer.

Specified by:
buildClusterer in interface Clusterer
Specified by:
buildClusterer in class AbstractClusterer
Parameters:
data - set of instances serving as training data
Throws:
java.lang.Exception - if the clusterer has not been generated successfully

checkForNominalAttributes

public boolean checkForNominalAttributes(Instances data)
Checks for nominal attributes in the dataset. Class attribute is ignored.

Parameters:
data - the data to check
Returns:
false if no nominal attributes are present

clusterInstance

public int clusterInstance(Instance instance)
                    throws java.lang.Exception
Classifies a given instance.

Specified by:
clusterInstance in interface Clusterer
Overrides:
clusterInstance in class AbstractClusterer
Parameters:
instance - the instance to be assigned to a cluster
Returns:
the number of the assigned cluster as an integer if the class is enumerated, otherwise the predicted value
Throws:
java.lang.Exception - if instance could not be classified successfully

numberOfClusters

public int numberOfClusters()
Returns the number of clusters.

Specified by:
numberOfClusters in interface Clusterer
Specified by:
numberOfClusters in class AbstractClusterer
Returns:
the number of clusters generated for a training dataset.

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Overrides:
listOptions in class RandomizableClusterer
Returns:
an enumeration of all the available options

minNumClustersTipText

public java.lang.String minNumClustersTipText()
Returns the tip text for this property.

Returns:
tip text for this property

setMinNumClusters

public void setMinNumClusters(int n)
Sets the minimum number of clusters to generate.

Parameters:
n - the minimum number of clusters to generate

getMinNumClusters

public int getMinNumClusters()
Gets the minimum number of clusters to generate.

Returns:
the minimum number of clusters to generate

maxNumClustersTipText

public java.lang.String maxNumClustersTipText()
Returns the tip text for this property.

Returns:
tip text for this property

setMaxNumClusters

public void setMaxNumClusters(int n)
Sets the maximum number of clusters to generate.

Parameters:
n - the maximum number of clusters to generate

getMaxNumClusters

public int getMaxNumClusters()
Gets the maximum number of clusters to generate.

Returns:
the maximum number of clusters to generate

maxIterationsTipText

public java.lang.String maxIterationsTipText()
Returns the tip text for this property.

Returns:
tip text for this property

setMaxIterations

public void setMaxIterations(int i)
                      throws java.lang.Exception
Sets the maximum number of iterations to perform.

Parameters:
i - the number of iterations
Throws:
java.lang.Exception - if i is less than 1

getMaxIterations

public int getMaxIterations()
Gets the maximum number of iterations.

Returns:
the number of iterations

maxKMeansTipText

public java.lang.String maxKMeansTipText()
Returns the tip text for this property.

Returns:
tip text for this property

setMaxKMeans

public void setMaxKMeans(int i)
Set the maximum number of iterations to perform in KMeans.

Parameters:
i - the number of iterations

getMaxKMeans

public int getMaxKMeans()
Gets the maximum number of iterations in KMeans.

Returns:
the number of iterations

maxKMeansForChildrenTipText

public java.lang.String maxKMeansForChildrenTipText()
Returns the tip text for this property.

Returns:
tip text for this property

setMaxKMeansForChildren

public void setMaxKMeansForChildren(int i)
Sets the maximum number of iterations KMeans that is performed on the child centers.

Parameters:
i - the number of iterations

getMaxKMeansForChildren

public int getMaxKMeansForChildren()
Gets the maximum number of iterations in KMeans.

Returns:
the number of iterations

cutOffFactorTipText

public java.lang.String cutOffFactorTipText()
Returns the tip text for this property.

Returns:
tip text for this property

setCutOffFactor

public void setCutOffFactor(double i)
Sets a new cutoff factor.

Parameters:
i - the new cutoff factor

getCutOffFactor

public double getCutOffFactor()
Gets the cutoff factor.

Returns:
the cutoff factor

binValueTipText

public java.lang.String binValueTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getBinValue

public double getBinValue()
Gets value that represents true in a new numeric attribute. (False is always represented by 0.0.)

Returns:
the value that represents true in a new numeric attribute

setBinValue

public void setBinValue(double value)
Sets the distance value between true and false of binary attributes. and "same" and "different" of nominal attributes

Parameters:
value - the distance

distanceFTipText

public java.lang.String distanceFTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setDistanceF

public void setDistanceF(DistanceFunction distanceF)
gets the "binary" distance value.

Parameters:
distanceF - the distance function with all options set

getDistanceF

public DistanceFunction getDistanceF()
Gets the distance function.

Returns:
the distance function

debugVectorsFileTipText

public java.lang.String debugVectorsFileTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setDebugVectorsFile

public void setDebugVectorsFile(java.io.File value)
Sets the file that has the random vectors stored. Only used for debugging reasons.

Parameters:
value - the file to read the random vectors from

getDebugVectorsFile

public java.io.File getDebugVectorsFile()
Gets the file name for a file that has the random vectors stored. Only used for debugging purposes.

Returns:
the file to read the vectors from

initDebugVectorsInput

public void initDebugVectorsInput()
                           throws java.lang.Exception
Initialises the debug vector input.

Throws:
java.lang.Exception - if there is error opening the debug input file.

getNextDebugVectorsInstance

public Instance getNextDebugVectorsInstance(Instances model)
                                     throws java.lang.Exception
Read an instance from debug vectors file.

Parameters:
model - the data model for the instance.
Returns:
the next debug vector.
Throws:
java.lang.Exception - if there are no debug vector in m_DebugVectors.

inputCenterFileTipText

public java.lang.String inputCenterFileTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setInputCenterFile

public void setInputCenterFile(java.io.File value)
Sets the file to read the list of centers from.

Parameters:
value - the file to read centers from

getInputCenterFile

public java.io.File getInputCenterFile()
Gets the file to read the list of centers from.

Returns:
the file to read the centers from

outputCenterFileTipText

public java.lang.String outputCenterFileTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setOutputCenterFile

public void setOutputCenterFile(java.io.File value)
Sets file to write the list of centers to.

Parameters:
value - file to write centers to

getOutputCenterFile

public java.io.File getOutputCenterFile()
Gets the file to write the list of centers to.

Returns:
filename of the file to write centers to

KDTreeTipText

public java.lang.String KDTreeTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setKDTree

public void setKDTree(KDTree k)
Sets the KDTree class.

Parameters:
k - a KDTree object with all options set

getKDTree

public KDTree getKDTree()
Gets the KDTree class.

Returns:
the configured KDTree

useKDTreeTipText

public java.lang.String useKDTreeTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setUseKDTree

public void setUseKDTree(boolean value)
Sets whether to use the KDTree or not.

Parameters:
value - if true the KDTree is used

getUseKDTree

public boolean getUseKDTree()
Gets whether the KDTree is used or not.

Returns:
true if KDTrees are used

debugLevelTipText

public java.lang.String debugLevelTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setDebugLevel

public void setDebugLevel(int d)
Sets the debug level. debug level = 0, means no output

Parameters:
d - debuglevel

getDebugLevel

public int getDebugLevel()
Gets the debug level.

Returns:
debug level

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options.

Valid options are:

 -I <num>
  maximum number of overall iterations
  (default 1).
 -M <num>
  maximum number of iterations in the kMeans loop in
  the Improve-Parameter part 
  (default 1000).
 -J <num>
  maximum number of iterations in the kMeans loop
  for the splitted centroids in the Improve-Structure part 
  (default 1000).
 -L <num>
  minimum number of clusters
  (default 2).
 -H <num>
  maximum number of clusters
  (default 4).
 -B <value>
  distance value for binary attributes
  (default 1.0).
 -use-kdtree
  Uses the KDTree internally
  (default no).
 -K <KDTree class specification>
  Full class name of KDTree class to use, followed
  by scheme options.
  eg: "weka.core.neighboursearch.kdtrees.KDTree -P"
  (default no KDTree class used).
 -C <value>
  cutoff factor, takes the given percentage of the splitted 
  centroids if none of the children win
  (default 0.0).
 -D <distance function class specification>
  Full class name of Distance function class to use, followed
  by scheme options.
  (default weka.core.EuclideanDistance).
 -N <file name>
  file to read starting centers from (ARFF format).
 -O <file name>
  file to write centers to (ARFF format).
 -U <int>
  The debug level.
  (default 0)
 -Y <file name>
  The debug vectors file.
 -S <num>
  Random number seed.
  (default 10)

Specified by:
setOptions in interface OptionHandler
Overrides:
setOptions in class RandomizableClusterer
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of SimpleKMeans.

Specified by:
getOptions in interface OptionHandler
Overrides:
getOptions in class RandomizableClusterer
Returns:
an array of strings suitable for passing to setOptions

toString

public java.lang.String toString()
Return a string describing this clusterer.

Overrides:
toString in class java.lang.Object
Returns:
a description of the clusterer as a string

getClusterCenters

public Instances getClusterCenters()
Return the centers of the clusters as an Instances object

Returns:
the cluster centers.

getRevision

public java.lang.String getRevision()
Returns the revision string.

Specified by:
getRevision in interface RevisionHandler
Overrides:
getRevision in class AbstractClusterer
Returns:
the revision

main

public static void main(java.lang.String[] argv)
Main method for testing this class.

Parameters:
argv - should contain options