How to tune the behavior of the clustering algorithm? - Data Mining

You can tune the behavior of the clustering algorithm by tweaking the various parameters of the algorithm. The defaults handle most situations, but under certain circumstances you may find that you get better results by manipulating one or more of these knobs.

  • The Clustering_Method indicates which algorithm is used to determine cluster membership. The vanilla versions of each algorithm eschew the scalable framework described previously and operate only on one sample of the data. The possible values for this parameter are:
    1. Scalable EM (default)
    2. Vanilla (non-scalable) EM
    3. Scalable K-means
    4. Vanilla (non-scalable) K-means
  • Cluster_Count is the “K” in K-means — it would also be the “K” in EM, if EM had a K. Cluster_Count indicates to the algorithm how many clusters to find. Set this parameter to a number that makes sense for your business problem. If you can comprehend eight clusters, set it to eight and see what you find. In practice, the more attributes you have, the more clusters you need to describe your data correctly. If you have too many attributes, you may want to organize your data ahead of time so that the number is reduced. Using the movie retailer as an example, instead of clustering by the individual movies that your customers watched, you could cluster by the genres of those movies. This technique substantially reduces the attribute cardinality and creates much more meaningful models. Setting Cluster_Count to 0 will cause the algorithm to perform a heuristic to guess the correct number of clusters in the data. The default value is 10.
  • Minimum_Support controls when a cluster is considered “empty” and it is discarded and reinitialized. Usually, you will not need to modify this parameter, except in certain cases when business rules apply. For example, for privacy reasons you may not want to create clusters smaller than 10 people. Note that this number is used internally only, and due to the nature of soft clustering you may have clusters reporting membership lower than this amount after training. Setting this number too high can create bad results. The default value is 1.
  • Modelling_Cardinality controls how many candidate models are generated during clustering. Reducing this value will increase performance, at the potential cost of reducing accuracy. The default value is 10.
  • Stopping_Tolerance is used by the algorithm to determine when a model has converged. It represents the maximum number of cases that can change membership before you consider a model to have converged. This value is checked at each iteration of the internal clustering loop, plus at the outer scalable step as well. Increasing this number will cause the algorithm to converge more quickly, resulting in fuzzier clusters, while decreasing it will result in tighter clusters. If you have a small data set or very distinct clusters, you can set this value to 1. The default value is 10.
  • Sample_Size indicates the number of cases used in each step of the scalable framework. When using the vanilla versions of the algorithm, Sample_Size indicates the total number of cases seen. Reducing this value can cause the algorithm to converge early without seeing all of the data, especially when coupled with a large Stopping_Tolerance. This can be useful for creating a quick clustering on a large dataset. Setting this value to 0 will cause the algorithm to use all available memory on the server. Note that due to the nature of the scalable framework, this can cause the algorithm to produce slightly different results with different memory configurations. The default value is 50,000.
  • One-dimensional EM clustering (top) stopping tolerance = 1 (bottom) stopping tolerance = 10.

    One-dimensional EM clustering (top) stopping tolerance = 1 (bottom) stopping tolerance = 10.

  • Cluster_Seed is the random number seed used to initialize the clusters. This parameter is provided to allow you to test the sensitivity of your data to the initialization point. If your models stay relatively stable when changing this value, you can be sure that the segmentation of your data is correct. The default value is 0.
  • Maximum_Input_Attributes controls how many of the attributes considered for clustering are allowed before automatic feature selection is invoked. If there are more than this number of attributes in your data set, feature selection will choose the most popular attributes from the set. The unselected attributes are ignored during clustering. This limit exists because the number of attributes has a significant impact on performance. The default value is 255.
  • Maximum_States controls how many states one particular attribute can have. If an attribute contains more than this number of states, the most popular states are chosen and the others are considered an “other” state. This limit exists due to the impact of high cardinality attributes on performance and memory. The default value is 100.

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd Protection Status

Data Mining Topics