weka.classifiers.functions
Class SGDText

java.lang.Object
  extended by weka.classifiers.AbstractClassifier
      extended by weka.classifiers.RandomizableClassifier
          extended by weka.classifiers.functions.SGDText
All Implemented Interfaces:
java.io.Serializable, java.lang.Cloneable, Classifier, UpdateableClassifier, CapabilitiesHandler, OptionHandler, Randomizable, RevisionHandler, WeightedInstancesHandler

public class SGDText
extends RandomizableClassifier
implements UpdateableClassifier, WeightedInstancesHandler

Implements stochastic gradient descent for learning a linear binary class SVM or binary class logistic regression on text data. Operates directly (and only) on String attributes. Other types of input attributes are accepted but ignored during training and classification.

Valid options are:

 -F
  Set the loss function to minimize. 0 = hinge loss (SVM), 1 = log loss (logistic regression)
  (default = 0)
 
 -outputProbs
  Output probabilities for SVMs (fits a logsitic
  model to the output of the SVM)
 
 -L
  The learning rate (default = 0.01).
 
 -R <double>
  The lambda regularization constant (default = 0.0001)
 
 -E <integer>
  The number of epochs to perform (batch learning only, default = 500)
 
 -W
  Use word frequencies instead of binary bag of words.
 
 -P <# instances>
  How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)
 
 -M <double>
  Minimum word frequency. Words with less than this frequence are ignored.
  If periodic pruning is turned on then this is also used to determine which
  words to remove from the dictionary (default = 3).
 
 -normalize
  Normalize document length (use in conjunction with -norm and -lnorm
 
 -norm <num>
  Specify the norm that each instance must have (default 1.0)
 
 -lnorm <num>
  Specify L-norm to use (default 2.0)
 
 -lowercase
  Convert all tokens to lowercase before adding to the dictionary.
 
 -stoplist
  Ignore words that are in the stoplist.
 
 -stopwords <file>
  A file containing stopwords to override the default ones.
  Using this option automatically sets the flag ('-stoplist') to use the
  stoplist if the file exists.
  Format: one stopword per line, lines starting with '#'
  are interpreted as comments and ignored.
 
 -tokenizer <spec>
  The tokenizing algorihtm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)
 
 -stemmer <spec>
  The stemmering algorihtm (classname plus parameters) to use.
 

Author:
Mark Hall (mhall{[at]}pentaho{[dot]}com), Eibe Frank (eibe{[at]}cs{[dot]}waikato{[dot]}ac{[dot]}nz)
See Also:
Serialized Form

Nested Class Summary
static class SGDText.Count
           
 
Field Summary
static int HINGE
          the hinge loss function.
static int LOGLOSS
          the log loss function.
static Tag[] TAGS_SELECTION
          Loss functions to choose from
 
Constructor Summary
SGDText()
           
 
Method Summary
 double bias()
           
 void buildClassifier(Instances data)
          Method for building the classifier.
 double[] distributionForInstance(Instance inst)
          Predicts the class memberships for a given instance.
 java.lang.String epochsTipText()
          Returns the tip text for this property
 Capabilities getCapabilities()
          Returns default capabilities of the classifier.
 java.util.LinkedHashMap<java.lang.String,SGDText.Count> getDictionary()
          Get this model's dictionary (including term weights).
 int getDictionarySize()
          Return the size of the dictionary (minus any low frequency terms that are below the threshold but haven't been pruned yet).
 int getEpochs()
          Get current number of epochs
 double getLambda()
          Get the current value of lambda
 double getLearningRate()
          Get the learning rate.
 double getLNorm()
          Get the L Norm used.
 SelectedTag getLossFunction()
          Get the current loss function.
 boolean getLowercaseTokens()
          Get whether to convert all tokens to lowercase
 double getMinWordFrequency()
          Get the minimum word frequency.
 double getNorm()
          Get the instance's Norm.
 boolean getNormalizeDocLength()
          Get whether to normalize the length of each document
 java.lang.String[] getOptions()
          Gets the current settings of the classifier.
 boolean getOutputProbsForSVM()
          Get whether to fit a logistic regression (itself trained using SGD) to the outputs of the SVM (if an SVM is being learned).
 int getPeriodicPruning()
          Get how often to prune the dictionary
 java.lang.String getRevision()
          Returns the revision string.
 Stemmer getStemmer()
          Returns the current stemming algorithm, null if none is used.
 java.io.File getStopwords()
          returns the file used for obtaining the stopwords, if the file represents a directory then the default ones are used.
 Tokenizer getTokenizer()
          Returns the current tokenizer algorithm.
 boolean getUseStopList()
          Get whether to ignore all words that are on the stoplist.
 boolean getUseWordFrequencies()
          Get whether to use word frequencies rather than binary bag of words representation.
 java.lang.String globalInfo()
          Returns a string describing classifier
 java.lang.String lambdaTipText()
          Returns the tip text for this property
 java.lang.String learningRateTipText()
          Returns the tip text for this property
 java.util.Enumeration<Option> listOptions()
          Returns an enumeration describing the available options.
 java.lang.String LNormTipText()
          Returns the tip text for this property
 java.lang.String lossFunctionTipText()
          Returns the tip text for this property
 java.lang.String lowercaseTokensTipText()
          Returns the tip text for this property
static void main(java.lang.String[] args)
          Main method for testing this class.
 java.lang.String minWordFrequencyTipText()
          Returns the tip text for this property
 java.lang.String normalizeDocLengthTipText()
          Returns the tip text for this property
 java.lang.String normTipText()
          Returns the tip text for this property
 java.lang.String outputProbsForSVMTipText()
          Returns the tip text for this property
 java.lang.String periodicPruningTipText()
          Returns the tip text for this property
 void reset()
          Reset the classifier.
 void setBias(double bias)
           
 void setEpochs(int e)
          Set the number of epochs to use
 void setLambda(double lambda)
          Set the value of lambda to use
 void setLearningRate(double lr)
          Set the learning rate.
 void setLNorm(double newLNorm)
          Set the L-norm to used
 void setLossFunction(SelectedTag function)
          Set the loss function to use.
 void setLowercaseTokens(boolean l)
          Set whether to convert all tokens to lowercase
 void setMinWordFrequency(double minFreq)
          Set the minimum word frequency.
 void setNorm(double newNorm)
          Set the norm of the instances
 void setNormalizeDocLength(boolean norm)
          Set whether to normalize the length of each document
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setOutputProbsForSVM(boolean o)
          Set whether to fit a logistic regression (itself trained using SGD) to the outputs of the SVM (if an SVM is being learned).
 void setPeriodicPruning(int p)
          Set how often to prune the dictionary
 void setStemmer(Stemmer value)
          the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
 void setStopwords(java.io.File value)
          sets the file containing the stopwords, null or a directory unset the stopwords.
 void setTokenizer(Tokenizer value)
          the tokenizer algorithm to use.
 void setUseStopList(boolean u)
          Set whether to ignore all words that are on the stoplist.
 void setUseWordFrequencies(boolean u)
          Set whether to use word frequencies rather than binary bag of words representation.
 java.lang.String stemmerTipText()
          Returns the tip text for this property.
 java.lang.String stopwordsTipText()
          Returns the tip text for this property.
 java.lang.String tokenizerTipText()
          Returns the tip text for this property.
 java.lang.String toString()
           
 void updateClassifier(Instance instance)
          Updates the classifier with the given instance.
 java.lang.String useStopListTipText()
          Returns the tip text for this property
 java.lang.String useWordFrequenciesTipText()
          Returns the tip text for this property
 
Methods inherited from class weka.classifiers.RandomizableClassifier
getSeed, seedTipText, setSeed
 
Methods inherited from class weka.classifiers.AbstractClassifier
classifyInstance, debugTipText, forName, getDebug, makeCopies, makeCopy, runClassifier, setDebug
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

HINGE

public static final int HINGE
the hinge loss function.

See Also:
Constant Field Values

LOGLOSS

public static final int LOGLOSS
the log loss function.

See Also:
Constant Field Values

TAGS_SELECTION

public static final Tag[] TAGS_SELECTION
Loss functions to choose from

Constructor Detail

SGDText

public SGDText()
Method Detail

getCapabilities

public Capabilities getCapabilities()
Returns default capabilities of the classifier.

Specified by:
getCapabilities in interface Classifier
Specified by:
getCapabilities in interface CapabilitiesHandler
Overrides:
getCapabilities in class AbstractClassifier
Returns:
the capabilities of this classifier
See Also:
Capabilities

setStemmer

public void setStemmer(Stemmer value)
the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).

Parameters:
value - the configured stemming algorithm, or null
See Also:
NullStemmer

getStemmer

public Stemmer getStemmer()
Returns the current stemming algorithm, null if none is used.

Returns:
the current stemming algorithm, null if none set

stemmerTipText

public java.lang.String stemmerTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setTokenizer

public void setTokenizer(Tokenizer value)
the tokenizer algorithm to use.

Parameters:
value - the configured tokenizing algorithm

getTokenizer

public Tokenizer getTokenizer()
Returns the current tokenizer algorithm.

Returns:
the current tokenizer algorithm

tokenizerTipText

public java.lang.String tokenizerTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

useWordFrequenciesTipText

public java.lang.String useWordFrequenciesTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setUseWordFrequencies

public void setUseWordFrequencies(boolean u)
Set whether to use word frequencies rather than binary bag of words representation.

Parameters:
u - true if word frequencies are to be used.

getUseWordFrequencies

public boolean getUseWordFrequencies()
Get whether to use word frequencies rather than binary bag of words representation.

Parameters:
u - true if word frequencies are to be used.

lowercaseTokensTipText

public java.lang.String lowercaseTokensTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setLowercaseTokens

public void setLowercaseTokens(boolean l)
Set whether to convert all tokens to lowercase

Parameters:
l - true if all tokens are to be converted to lowercase

getLowercaseTokens

public boolean getLowercaseTokens()
Get whether to convert all tokens to lowercase

Returns:
true true if all tokens are to be converted to lowercase

useStopListTipText

public java.lang.String useStopListTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setUseStopList

public void setUseStopList(boolean u)
Set whether to ignore all words that are on the stoplist.

Parameters:
u - true to ignore all words on the stoplist.

getUseStopList

public boolean getUseStopList()
Get whether to ignore all words that are on the stoplist.

Returns:
true to ignore all words on the stoplist.

setStopwords

public void setStopwords(java.io.File value)
sets the file containing the stopwords, null or a directory unset the stopwords. If the file exists, it automatically turns on the flag to use the stoplist.

Parameters:
value - the file containing the stopwords

getStopwords

public java.io.File getStopwords()
returns the file used for obtaining the stopwords, if the file represents a directory then the default ones are used.

Returns:
the file containing the stopwords

stopwordsTipText

public java.lang.String stopwordsTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

periodicPruningTipText

public java.lang.String periodicPruningTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setPeriodicPruning

public void setPeriodicPruning(int p)
Set how often to prune the dictionary

Parameters:
p - how often to prune

getPeriodicPruning

public int getPeriodicPruning()
Get how often to prune the dictionary

Returns:
how often to prune the dictionary

minWordFrequencyTipText

public java.lang.String minWordFrequencyTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setMinWordFrequency

public void setMinWordFrequency(double minFreq)
Set the minimum word frequency. Words that don't occur at least min freq times are ignored when updating weights. If periodic pruning is turned on, then min frequency is used when removing words from the dictionary.

Parameters:
minFreq - the minimum word frequency to use

getMinWordFrequency

public double getMinWordFrequency()
Get the minimum word frequency. Words that don't occur at least min freq times are ignored when updating weights. If periodic pruning is turned on, then min frequency is used when removing words from the dictionary.

Parameters:
return - the minimum word frequency to use

normalizeDocLengthTipText

public java.lang.String normalizeDocLengthTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setNormalizeDocLength

public void setNormalizeDocLength(boolean norm)
Set whether to normalize the length of each document

Parameters:
norm - true if document lengths is to be normalized

getNormalizeDocLength

public boolean getNormalizeDocLength()
Get whether to normalize the length of each document

Returns:
true if document lengths is to be normalized

normTipText

public java.lang.String normTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getNorm

public double getNorm()
Get the instance's Norm.

Returns:
the Norm

setNorm

public void setNorm(double newNorm)
Set the norm of the instances

Parameters:
newNorm - the norm to wich the instances must be set

LNormTipText

public java.lang.String LNormTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getLNorm

public double getLNorm()
Get the L Norm used.

Returns:
the L-norm used

setLNorm

public void setLNorm(double newLNorm)
Set the L-norm to used

Parameters:
newLNorm - the L-norm

lambdaTipText

public java.lang.String lambdaTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setLambda

public void setLambda(double lambda)
Set the value of lambda to use

Parameters:
lambda - the value of lambda to use

getLambda

public double getLambda()
Get the current value of lambda

Returns:
the current value of lambda

setLearningRate

public void setLearningRate(double lr)
Set the learning rate.

Parameters:
lr - the learning rate to use.

getLearningRate

public double getLearningRate()
Get the learning rate.

Returns:
the learning rate

learningRateTipText

public java.lang.String learningRateTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

epochsTipText

public java.lang.String epochsTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setEpochs

public void setEpochs(int e)
Set the number of epochs to use

Parameters:
e - the number of epochs to use

getEpochs

public int getEpochs()
Get current number of epochs

Returns:
the current number of epochs

setLossFunction

public void setLossFunction(SelectedTag function)
Set the loss function to use.

Parameters:
function - the loss function to use.

getLossFunction

public SelectedTag getLossFunction()
Get the current loss function.

Returns:
the current loss function.

lossFunctionTipText

public java.lang.String lossFunctionTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setOutputProbsForSVM

public void setOutputProbsForSVM(boolean o)
Set whether to fit a logistic regression (itself trained using SGD) to the outputs of the SVM (if an SVM is being learned).

Parameters:
o - true if a logistic regression is to be fit to the output of the SVM to produce probability estimates.

getOutputProbsForSVM

public boolean getOutputProbsForSVM()
Get whether to fit a logistic regression (itself trained using SGD) to the outputs of the SVM (if an SVM is being learned).

Returns:
true if a logistic regression is to be fit to the output of the SVM to produce probability estimates.

outputProbsForSVMTipText

public java.lang.String outputProbsForSVMTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

listOptions

public java.util.Enumeration<Option> listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Overrides:
listOptions in class RandomizableClassifier
Returns:
an enumeration of all the available options.

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options.

Valid options are:

 -F
  Set the loss function to minimize. 0 = hinge loss (SVM), 1 = log loss (logistic regression)
  (default = 0)
 
 -outputProbs
  Output probabilities for SVMs (fits a logsitic
  model to the output of the SVM)
 
 -L
  The learning rate (default = 0.01).
 
 -R <double>
  The lambda regularization constant (default = 0.0001)
 
 -E <integer>
  The number of epochs to perform (batch learning only, default = 500)
 
 -W
  Use word frequencies instead of binary bag of words.
 
 -P <# instances>
  How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)
 
 -M <double>
  Minimum word frequency. Words with less than this frequence are ignored.
  If periodic pruning is turned on then this is also used to determine which
  words to remove from the dictionary (default = 3).
 
 -normalize
  Normalize document length (use in conjunction with -norm and -lnorm
 
 -norm <num>
  Specify the norm that each instance must have (default 1.0)
 
 -lnorm <num>
  Specify L-norm to use (default 2.0)
 
 -lowercase
  Convert all tokens to lowercase before adding to the dictionary.
 
 -stoplist
  Ignore words that are in the stoplist.
 
 -stopwords <file>
  A file containing stopwords to override the default ones.
  Using this option automatically sets the flag ('-stoplist') to use the
  stoplist if the file exists.
  Format: one stopword per line, lines starting with '#'
  are interpreted as comments and ignored.
 
 -tokenizer <spec>
  The tokenizing algorihtm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)
 
 -stemmer <spec>
  The stemmering algorihtm (classname plus parameters) to use.
 

Specified by:
setOptions in interface OptionHandler
Overrides:
setOptions in class RandomizableClassifier
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of the classifier.

Specified by:
getOptions in interface OptionHandler
Overrides:
getOptions in class RandomizableClassifier
Returns:
an array of strings suitable for passing to setOptions

globalInfo

public java.lang.String globalInfo()
Returns a string describing classifier

Returns:
a description suitable for displaying in the explorer/experimenter gui

reset

public void reset()
Reset the classifier.


buildClassifier

public void buildClassifier(Instances data)
                     throws java.lang.Exception
Method for building the classifier.

Specified by:
buildClassifier in interface Classifier
Parameters:
data - the set of training instances.
Throws:
java.lang.Exception - if the classifier can't be built successfully.

updateClassifier

public void updateClassifier(Instance instance)
                      throws java.lang.Exception
Updates the classifier with the given instance.

Specified by:
updateClassifier in interface UpdateableClassifier
Parameters:
instance - the new training instance to include in the model
Throws:
java.lang.Exception - if the instance could not be incorporated in the model.

distributionForInstance

public double[] distributionForInstance(Instance inst)
                                 throws java.lang.Exception
Description copied from class: AbstractClassifier
Predicts the class memberships for a given instance. If an instance is unclassified, the returned array elements must be all zero. If the class is numeric, the array must consist of only one element, which contains the predicted value. Note that a classifier MUST implement either this or classifyInstance().

Specified by:
distributionForInstance in interface Classifier
Overrides:
distributionForInstance in class AbstractClassifier
Parameters:
inst - the instance to be classified
Returns:
an array containing the estimated membership probabilities of the test instance in each class or the numeric prediction
Throws:
java.lang.Exception - if distribution could not be computed successfully

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object

getDictionary

public java.util.LinkedHashMap<java.lang.String,SGDText.Count> getDictionary()
Get this model's dictionary (including term weights).

Returns:
this model's dictionary.

getDictionarySize

public int getDictionarySize()
Return the size of the dictionary (minus any low frequency terms that are below the threshold but haven't been pruned yet).

Returns:
the size of the dictionary.

bias

public double bias()

setBias

public void setBias(double bias)

getRevision

public java.lang.String getRevision()
Returns the revision string.

Specified by:
getRevision in interface RevisionHandler
Overrides:
getRevision in class AbstractClassifier
Returns:
the revision

main

public static void main(java.lang.String[] args)
Main method for testing this class.