Industries Needs: Data Science and Big Data Analytics

Discovering, Analyzing, Visualizing and Presenting Data

Advanced Analytical Theory and Methods: Clustering

Key Concepts

1. Centroid

2. Clustering

3. K-means

4. Unsupervised

5. Within Sum of Squares

Building upon the introduction to R presented in Chapter 3, “Review of Basic Data Analytic Methods Using R,” Chapter 4, “Advanced Analytical Theory and Methods: Clustering” through Chapter 9, “Advanced Analytical Theory and Methods: Text Analysis” describe several commonly used analytical methods that may be considered for the Model Planning and Execution phases (Phases 3 and 4) of the Data Analytics Lifecycle. This chapter considers clustering techniques and algorithms.

4.1 Overview of Clustering

In general, clustering is the use of unsupervised techniques for grouping similar objects. In machine learning, unsupervised refers to the problem of finding hidden structure within unlabeled data. Clustering techniques are unsupervised in the sense that the data scientist does not determine, in advance, the labels to apply to the clusters. The structure of the data describes the objects of interest and determines how best to group the objects. For example, based on customers’ personal income, it is straightforward to divide the customers into three groups depending on arbitrarily selected values. The customers could be divided into three groups as follows:

• Earn less than $10,000

• Earn between $10,000 and $99,999

• Earn $100,000 or more

In this case, the income levels were chosen somewhat subjectively based on easy-to[1]communicate points of delineation. However, such groupings do not indicate a natural affinity of the customers within each group. In other words, there is no inherent reason to believe that the customer making $90,000 will behave any differently than the customer making $110,000. As additional dimensions are introduced by adding more variables about the customers, the task of finding meaningful groupings becomes more complex. For instance, suppose variables such as age, years of education, household size, and annual purchase expenditures were considered along with the personal income variable. What are the natural occurring groupings of customers? This is the type of question that clustering analysis can help answer.

Clustering is a method often used for exploratory analysis of the data. In clustering, there are no predictions made. Rather, clustering methods find the similarities between objects according to the object attributes and group the similar objects into clusters. Clustering techniques are utilized in marketing, economics, and various branches of science. A popular clustering method is k-means.

4.2 K-means

Given a collection of objects each with n measurable attributes, k-means [1] is an analytical technique that, for a chosen value of k, identifies k clusters of objects based on the objects’ proximity to the center of the k groups. The center is determined as the arithmetic average (mean) of each cluster’s n-dimensional vector of attributes. This section describes the algorithm to determine the k means as well as how best to apply this technique to several use cases. Figure 4.1 illustrates three clusters of objects with two attributes. Each object in the dataset is represented by a small dot color-coded to the closest large dot, the mean of the cluster.

Figure 4.1 Possible k-means clusters for k=3

4.2.1 Use Cases

Clustering is often used as a lead-in to classification. Once the clusters are identified, labels can be applied to each cluster to classify each group based on its characteristics. Classification is covered in more detail in Chapter 7, “Advanced Analytical Theory and Methods: Classification.” Clustering is primarily an exploratory technique to discover hidden structures of the data, possibly as a prelude to more focused analysis or decision processes. Some specific applications of k-means are image processing, medical, and customer segmentation.

Image Processing

Video is one example of the growing volumes of unstructured data being collected. Within each frame of a video, k-means analysis can be used to identify objects in the video. For each frame, the task is to determine which pixels are most similar to each other. The attributes of each pixel can include brightness, color, and location, the x and y coordinates in the frame. With security video images, for example, successive frames are examined to identify any changes to the clusters. These newly identified clusters may indicate unauthorized access to a facility.

Medical

Patient attributes such as age, height, weight, systolic and diastolic blood pressures, cholesterol level, and other attributes can identify naturally occurring clusters. These clusters could be used to target individuals for specific preventive measures or clinical trial participation. Clustering, in general, is useful in biology for the classification of plants and animals as well as in the field of human genetics.

Customer Segmentation

Marketing and sales groups use k-means to better identify customers who have similar behaviors and spending patterns. For example, a wireless provider may look at the following customer attributes: monthly bill, number of text messages, data volume consumed, minutes used during various daily periods, and years as a customer. The wireless company could then look at the naturally occurring clusters and consider tactics to increase sales or reduce the customer churn rate, the proportion of customers who end their relationship with a particular company.

4.2.2 Overview of the Method

To illustrate the method to find k clusters from a collection of M objects with n attributes, the two-dimensional case (n = 2) is examined. It is much easier to visualize the k-means method in two dimensions. Later in the chapter, the two-dimension scenario is generalized to handle any number of attributes.

Because each object in this example has two attributes, it is useful to consider each object corresponding to the point , where x and y denote the two attributes and i = 1, 2 … M. For a given cluster of m points (m≤M), the point that corresponds to the cluster’s mean is called a centroid. In mathematics, a centroid refers to a point that corresponds to the center of mass for an object.

The k-means algorithm to find k clusters can be described in the following four steps.

1. Choose the value of k and the k initial guesses for the centroids.

In this example, k = 3, and the initial centroids are indicated by the points shaded in red, green, and blue in Figure 4.2.

2. Compute the distance from each data point to each centroid. Assign each point to the closest centroid. This association defines the first k clusters.

In two dimensions, the distance, d, between any two points, and , in the Cartesian plane is typically expressed by using the Euclidean distance measure provided in Equation 4.1.

In Figure 4.3, the points closest to a centroid are shaded the corresponding color.

3. Compute the centroid, the center of mass, of each newly defined cluster from Step 2.

In Figure 4.4, the computed centroids in Step 3 are the lightly shaded points of the corresponding color. In two dimensions, the centroid ( ) of the m points in a k-means cluster is calculated as follows in Equation 4.2.

Thus, is the ordered pair of the arithmetic means of the coordinates of the m points in the cluster. In this step, a centroid is computed for each of the k clusters.

4. Repeat Steps 2 and 3 until the algorithm converges to an answer.

1. Assign each point to the closest centroid computed in Step 3.

2. Compute the centroid of newly defined clusters.

3. Repeat until the algorithm reaches the final answer.

Convergence is reached when the computed centroids do not change or the centroids and the assigned points oscillate back and forth from one iteration to the next. The latter case can occur when there are one or more points that are equal distances from the computed centroid.

To generalize the prior algorithm to n dimensions, suppose there are M objects, where each object is described by n attributes or property values (p₁, p₂,…p_n). Then object i is described by (p_i1, p_i2,…p_in) for i = 1,2,…, M. In other words, there is a matrix with M rows corresponding to the M objects and n columns to store the attribute values. To expand the earlier process to find the k clusters from two dimensions to n dimensions, the following equations provide the formulas for calculating the distances and the locations of the centroids for n ≥ 1.

For a given point, pi , at (p_i1, p_i2,…p_in) and a centroid, q, located at (q₁, q₂, …q_n, the distance, d, between pi and q, is expressed as shown in Equation 4.3.

Figure 4.2 Initial starting points for the centroids

Figure 4.3 Points are assigned to the closest centroid

Figure 4.4 Compute the mean of each cluster

4.2.3 Determining the Number of Clusters

With the preceding algorithm, k clusters can be identified in a given dataset, but what value of k should be selected? The value of k can be chosen based on a reasonable guess or some predefined requirement. However, even then, it would be good to know how much better or worse having k clusters versus k – 1 or k + 1 clusters would be in explaining the structure of the data. Next, a heuristic using the Within Sum of Squares (WSS) metric is examined to determine a reasonably optimal value of k. Using the distance function given in Equation 4.3, WSS is defined as shown in Equation 4.5.

In other words, WSS is the sum of the squares of the distances between each data point and the closest centroid. The term indicates the closest centroid that is associated with the ith point. If the points are relatively close to their respective centroids, the WSS is relatively small. Thus, if k + 1 clusters do not greatly reduce the value of WSS from the case with only k clusters, there may be little benefit to adding another cluster.

Using R to Perform a K-means Analysis

To illustrate how to use the WSS to determine an appropriate number, k, of clusters, the following example uses R to perform a k-means analysis. The task is to group 620 high school seniors based on their grades in three subject areas: English, mathematics, and science. The grades are averaged over their high school career and assume values from 0 to 100. The following R code establishes the necessary R libraries and imports the CSV file containing the grades.

library(plyr)

library(ggplot2)

library(cluster)

library(lattice)

library(graphics)

library(grid)

library(gridExtra)

#import the student grades

grade_input = as.data.frame(read.csv(“c:/data/grades_km_input.csv”))

The following R code formats the grades for processing. The data file contains four columns. The first column holds a student identification (ID) number, and the other three columns are for the grades in the three subject areas. Because the student ID is not used in the clustering analysis, it is excluded from the k-means input matrix, kmdata.

kmdata_orig = as.matrix(grade_input[,c(“Student”,“English”,

“Math”,“Science”)])

kmdata <- kmdata_orig[,2:4]

kmdata[1:10,]

English Math Science

[1,] 99 96 97

[2,] 99 96 97

[3,] 98 97 97

[4,] 95 100 95

[5,] 95 96 96

[6,] 96 97 96

[7,] 100 96 97

[8,] 95 98 98

[9,] 98 96 96

[10,] 99 99 95

To determine an appropriate value for k, the k-means algorithm is used to identify clusters for k = 1, 2, …, 15. For each value of k, the WSS is calculated. If an additional cluster provides a better partitioning of the data points, the WSS should be markedly smaller than without the additional cluster.

The following R code loops through several k-means analyses for the number of centroids, k, varying from 1 to 15. For each k, the option nstart=25 specifies that the k-means algorithm will be repeated 25 times, each starting with k random initial centroids. The corresponding value of WSS for each k-mean analysis is stored in the wss vector.

wss <- numeric(15)

for (k in 1:15) wss[k] <- sum(kmeans(kmdata, centers=k,

nstart=25)$withinss)

Using the basic R plot function, each WSS is plotted against the respective number of centroids, 1 through 15. This plot is provided in Figure 4.5.

plot(1:15, wss, type=“b”, xlab=“Number of Clusters”, ylab=“Within Sum of Squares”)

Figure 4.5 WSS of the student grade data

As can be seen, the WSS is greatly reduced when k increases from one to two. Another substantial reduction in WSS occurs at k = 3. However, the improvement in WSS is fairly linear for k > 3. Therefore, the k-means analysis will be conducted for k = 3. The process of identifying the appropriate value of k is referred to as finding the “elbow” of the WSS curve.

km = kmeans(kmdata,3, nstart=25)

K-means clustering with 3 clusters of sizes 158, 218, 244

Cluster means:

English Math Science

1 97.21519 93.37342 94.86076

2 73.22018 64.62844 65.84862

3 85.84426 79.68033 81.50820

Clustering vector:

[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

[41] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

[81] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

[121] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

3 3 3 3 3 3 3 3 3 3

[161] 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 1 1 3 3 1 3 3 3 1 3 3 3 3

3 3 1 3 3 3 3 3 3 3

[201] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

3 3 3 3 3 3 3 3 3 3

[241] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

3 3 3 3 3 3 3 3 3 3

[281] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

3 3 3 3 3 3 3 3 3 3

[321] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

3 3 3 3 3 3 3 3 3 3

[361] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 3 2 3 2 3 3 3

2 2 2 2 3 3 2 2 2 2

[401] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2

[441] 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3

2 2 2 2 2 2 2 2 3 2

[481] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2

[521] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2

[561] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2

[601] 3 3 2 2 3 3 3 3 1 1 3 3 3 2 2 3 2 3 3 3

Within cluster sum of squares by cluster:

[1] 6692.589 34806.339 22984.131

(between_SS / total_SS = 76.5 %)

Available components:

[1] “cluster” “centers” “totss” “withinss” “tot.withinss”

[6] “betweenss” “size” “iter” “ifault”

The displayed contents of the variable km include the following:

• The location of the cluster means

• A clustering vector that defines the membership of each student to a corresponding cluster 1, 2, or 3

• The WSS of each cluster

• A list of all the available k-means components

The reader can find details on these components and using k-means in R by employing the help facility.

The reader may have wondered whether the k-means results stored in km are equivalent to the WSS results obtained earlier in generating the plot in Figure 4.5. The following check verifies that the results are indeed equivalent.

c( wss[3] , sum(km$withinss) )

[1] 64483.06 64483.06

In determining the value of k, the data scientist should visualize the data and assigned clusters. In the following code, the ggplot2 package is used to visualize the identified student clusters and centroids.

#prepare the student data and clustering results for plotting

df = as.data.frame(kmdata_orig[,2:4])

df$cluster = factor(km$cluster)

centers=as.data.frame(km$centers)

g1= ggplot(data=df, aes(x=English, y=Math, color=cluster )) +

geom_point() + theme(legend.position=“right”) +

geom_point(data=centers,

aes(x=English,y=Math, color=as.factor(c(1,2,3))),

size=10, alpha=.3, show_guide=FALSE)

g2 =ggplot(data=df, aes(x=English, y=Science, color=cluster )) +

geom_point() +

geom_point(data=centers,

aes(x=English,y=Science, color=as.factor(c(1,2,3))),

size=10, alpha=.3, show_guide=FALSE)

g3 = ggplot(data=df, aes(x=Math, y=Science, color=cluster )) +

geom_point() +

geom_point(data=centers,

aes(x=Math,y=Science, color=as.factor(c(1,2,3))),

size=10, alpha=.3, show_guide=FALSE)

tmp = ggplot_gtable(ggplot_build(g1))

grid.arrange(arrangeGrob(g1 + theme(legend.position=“none”),

g2 + theme(legend.position=“none”),

g3 + theme(legend.position=“none”),

main =“High School Student Cluster Analysis”,

ncol=1))

The resulting plots are provided in Figure 4.6. The large circles represent the location of the cluster means provided earlier in the display of the km contents. The small dots represent the students corresponding to the appropriate cluster by assigned color: red, blue, or green. In general, the plots indicate the three clusters of students: the top academic students (red), the academically challenged students (green), and the other students (blue) who fall somewhere between those two groups. The plots also highlight which students may excel in one or two subject areas but struggle in other areas.

Figure 4.6 Plots of the identified student clusters

Assigning labels to the identified clusters is useful to communicate the results of an analysis. In a marketing context, it is common to label a group of customers as frequent shoppers or big spenders. Such designations are especially useful when communicating the clustering results to business users or executives. It is better to describe the marketing plan for big spenders rather than Cluster #1.

4.2.4 Diagnostics

The heuristic using WSS can provide at least several possible k values to consider. When the number of attributes is relatively small, a common approach to further refine the choice of k is to plot the data to determine how distinct the identified clusters are from each other. In general, the following questions should be considered.

• Are the clusters well separated from each other?

• Do any of the clusters have only a few points?

• Do any of the centroids appear to be too close to each other?

In the first case, ideally the plot would look like the one shown in Figure 4.7, when n = 2. The clusters are well defined, with considerable space between the four identified clusters. However, in other cases, such as Figure 4.8, the clusters may be close to each other, and the distinction may not be so obvious.

Figure 4.7 Example of distinct clusters

Figure 4.8 Example of less obvious clusters

In such cases, it is important to apply some judgment on whether anything different will result by using more clusters. For example, Figure 4.9 uses six clusters to describe the same dataset as used in Figure 4.8. If using more clusters does not better distinguish the groups, it is almost certainly better to go with fewer clusters.

Figure 4.9 Six clusters applied to the points from Figure 4.8

4.2.5 Reasons to Choose and Cautions

K-means is a simple and straightforward method for defining clusters. Once clusters and their associated centroids are identified, it is easy to assign new objects (for example, new customers) to a cluster based on the object’s distance from the closest centroid. Because the method is unsupervised, using k-means helps to eliminate subjectivity from the analysis.

Although k-means is considered an unsupervised method, there are still several decisions that the practitioner must make:

• What object attributes should be included in the analysis?

• What unit of measure (for example, miles or kilometers) should be used for each attribute?

• Do the attributes need to be rescaled so that one attribute does not have a disproportionate effect on the results?

• What other considerations might apply?

Object Attributes

Regarding which object attributes (for example, age and income) to use in the analysis, it is important to understand what attributes will be known at the time a new object will be assigned to a cluster. For example, information on existing customers’ satisfaction or purchase frequency may be available, but such information may not be available for potential customers.

The Data Scientist may have a choice of a dozen or more attributes to use in the clustering analysis. Whenever possible and based on the data, it is best to reduce the number of attributes to the extent possible. Too many attributes can minimize the impact of the most important variables. Also, the use of several similar attributes can place too much importance on one type of attribute. For example, if five attributes related to personal wealth are included in a clustering analysis, the wealth attributes dominate the analysis and possibly mask the importance of other attributes, such as age.

When dealing with the problem of too many attributes, one useful approach is to identify any highly correlated attributes and use only one or two of the correlated attributes in the clustering analysis. As illustrated in Figure 4.10, a scatterplot matrix, as introduced in Chapter 3, is a useful tool to visualize the pair-wise relationships between the attributes.

Figure 4.10 Scatterplot matrix for seven attributes

The strongest relationship is observed to be between Attribute3 and Attribute7. If the value of one of these two attributes is known, it appears that the value of the other attribute is known with near certainty. Other linear relationships are also identified in the plot. For example, consider the plot of Attribute2 against Attribute3. If the value of Attribute2 is known, there is still a wide range of possible values for Attribute3. Thus, greater consideration must be given prior to dropping one of these attributes from the clustering analysis.

Another option to reduce the number of attributes is to combine several attributes into one measure. For example, instead of using two attribute variables, one for Debt and one for Assets, a Debt to Asset ratio could be used. This option also addresses the problem when the magnitude of an attribute is not of real interest, but the relative magnitude is a more important measure.

Units of Measure

From a computational perspective, the k-means algorithm is somewhat indifferent to the units of measure for a given attribute (for example, meters or centimeters for a patient’s height). However, the algorithm will identify different clusters depending on the choice of the units of measure. For example, suppose that k-means is used to cluster patients based on age in years and height in centimeters. For k=2, Figure 4.11 illustrates the two clusters that would be determined for a given dataset.

Figure 4.11 Clusters with height expressed in centimeters

But if the height was rescaled from centimeters to meters by dividing by 100, the resulting clusters would be slightly different, as illustrated in Figure 4.12.

Figure 4.12 Clusters with height expressed in meters

When the height is expressed in meters, the magnitude of the ages dominates the distance calculation between two points. The height attribute provides only as much as the square between the difference of the maximum height and the minimum height or (2.0-0)² =4 to the radicand, the number under the square root symbol in the distance formula given in Equation 4.3. Age can contribute as much as (80-0)² = 6,400 to the radicand when measuring the distance.

Rescaling

Attributes that are expressed in dollars are common in clustering analyses and can differ in magnitude from the other attributes. For example, if personal income is expressed in dollars and age is expressed in years, the income attribute, often exceeding $10,000, can easily dominate the distance calculation with ages typically less than 100 years.

Although some adjustments could be made by expressing the income in thousands of dollars (for example, 10 for $10,000), a more straightforward method is to divide each attribute by the attribute’s standard deviation. The resulting attributes will each have a standard deviation equal to 1 and will be without units. Returning to the age and height example, the standard deviations are 23.1 years and 36.4 cm, respectively. Dividing each attribute value by the appropriate standard deviation and performing the k-means analysis yields the result shown in Figure 4.13.

Figure 4.13 Clusters with rescaled attributes

With the rescaled attributes for age and height, the borders of the resulting clusters now fall somewhere between the two earlier clustering analyses. Such an occurrence is not surprising based on the magnitudes of the attributes of the previous clustering attempts. Some practitioners also subtract the means of the attributes to center the attributes around zero. However, this step is unnecessary because the distance formula is only sensitive to the scale of the attribute, not its location.

In many statistical analyses, it is common to transform typically skewed data, such as income, with long tails by taking the logarithm of the data. Such transformation can also be appied in k-means, but the Data Scientist needs to be aware of what effect this transformation will have. For example, if of income expressed in dollars is used, the practitioner is essentially stating that, from a clustering perspective, $1,000 is as close to $10,000 as $10,000 is to $100,000 . In many cases, the skewness of the data may be the reason to perform the clustering analysis in the first place.

Additional Considerations

The k-means algorithm is sensitive to the starting positions of the initial centroid. Thus, it is important to rerun the k-means analysis several times for a particular value of k to ensure the cluster results provide the overall minimum WSS. As seen earlier, this task is accomplished in R by using the nstart option in the kmeans() function call.

This chapter presented the use of the Euclidean distance function to assign the points to the closest centroids. Other possible function choices include the cosine similarity and the Manhattan distance functions. The cosine similarity function is often chosen to compare two documents based on the frequency of each word that appears in each of the documents [2]. For two points, p and q, at (p₁, p₂,… p_n) and (q₁, q₂,… q_n), respectively, the Manhattan distance, d1 , between p and q is expressed as shown in Equation 4.6.

The Manhattan distance function is analogous to the distance traveled by a car in a city, where the streets are laid out in a rectangular grid (such as city blocks). In Euclidean distance, the measurement is made in a straight line. Using Equation 4.6, the distance from (1, 1) to (4, 5) would be |1 – 4| + |1 – 5| = 7. From an optimization perspective, if there is a need to use the Manhattan distance for a clustering analysis, the median is a better choice for the centroid than use of the mean [2].

K-means clustering is applicable to objects that can be described by attributes that are numerical with a meaningful distance measure. From Chapter 3, interval and ratio attribute types can certainly be used. However, k-means does not handle categorical variables well. For example, suppose a clustering analysis is to be conducted on new car sales. Among other attributes, such as the sale price, the color of the car is considered important. Although one could assign numerical values to the color, such as red = 1, yellow = 2, and green = 3, it is not useful to consider that yellow is as close to red as yellow is to green from a clustering perspective. In such cases, it may be necessary to use an alternative clustering methodology. Such methods are described in the next section.

4.3 Additional Algorithms

The k-means clustering method is easily applied to numeric data where the concept of distance can naturally be applied. However, it may be necessary or desirable to use an alternative clustering algorithm. As discussed at the end of the previous section, k-means does not handle categorical data. In such cases, k-modes [3] is a commonly used method for clustering categorical data based on the number of differences in the respective components of the attributes. For example, if each object has four attributes, the distance from (a, b, e, d) to (d, d, d, d) is 3. In R, the function kmode() is implemented in the klaR package.

Because k-means and k-modes divide the entire dataset into distinct groups, both approaches are considered partitioning methods. A third partitioning method is known as Partitioning around Medoids (PAM) [4]. In general, a medoid is a representative object in a set of objects. In clustering, the medoids are the objects in each cluster that minimize the sum of the distances from the medoid to the other objects in the cluster. The advantage of using PAM is that the “center” of each cluster is an actual object in the dataset. PAM is implemented in R by the pam() function included in the cluster R package. The fpc R package includes a function pamk(), which uses the pam() function to find the optimal value for k.

Other clustering methods include hierarchical agglomerative clustering and density clustering methods. In hierarchical agglomerative clustering, each object is initially placed in its own cluster. The clusters are then combined with the most similar cluster. This process is repeated until one cluster, which includes all the objects, exists. The R stats package includes the hclust() function for performing hierarchical agglomerative clustering. In density-based clustering methods, the clusters are identified by the concentration of points. The fpc R package includes a function, dbscan(), to perform density-based clustering analysis. Density-based clustering can be useful to identify irregularly shaped clusters.

Summary

Clustering analysis groups similar objects based on the objects’ attributes. Clustering is applied in areas such as marketing, economics, biology, and medicine. This chapter presented a detailed explanation of the k-means algorithm and its implementation in R. To use k-means properly, it is important to do the following:

• Properly scale the attribute values to prevent certain attributes from dominating the other attributes.

• Ensure that the concept of distance between the assigned values within an attribute is meaningful.

• Choose the number of clusters, k, such that the sum of the Within Sum of Squares (WSS) of the distances is reasonably minimized. A plot such as the example in Figure 4.5 can be helpful in this respect.

If k-means does not appear to be an appropriate clustering technique for a given dataset, then alternative techniques such as k-modes or PAM should be considered.

Once the clusters are identified, it is often useful to label these clusters in some descriptive way. Especially when dealing with upper management, these labels are useful to easily communicate the findings of the clustering analysis. In clustering, the labels are not preassigned to each object. The labels are subjectively assigned after the clusters have been identified. Chapter 7 considers several methods to perform the classification of objects with predetermined labels. Clustering can be used with other analytical techniques, such as regression. Linear regression and logistic regression are covered in Chapter 6, “Advanced Analytical Theory and Methods: Regression.”

Exercises

1. Using the age and height clustering example in section 4.2.5, algebraically illustrate the impact on the measured distance when the height is expressed in meters rather than centimeters. Explain why different clusters will result depending on the choice of units for the patient’s height.

2. Compare and contrast five clustering algorithms, assigned by the instructor or selected by the student.

3. Using the ruspini dataset provided with the cluster package in R, perform a k[1]means analysis. Document the findings and justify the choice of k. Hint: use data(ruspini) to load the dataset into the R workspace.

Bibliography

1. [1] J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, 1967.

2. [2] P.-N. Tan, V. Kumar, and M. Steinbach, Introduction to Data Mining, Upper Saddle River, NJ: Person, 2013.

3. [3] Z. Huang, “A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining,” 1997. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/ download? doi=10.1.1.134.83&rep=rep1&type=pdf. [Accessed 13 March 2014].

4. [4] L. Kaufman and P. J. Rousseeuw, “Partitioning Around Medoids (Program PAM),” in Finding Groups in Data: An Introduction to Cluster Analysis, Hoboken, NJ, John Wiley & Sons, Inc, 2008, p. 68-125, Chapter 2.

Sunday, March 6, 2022

Data Science and Big Data Analytics

No comments:

Post a Comment

Labels

INSTRUMENTATION MANUFACTURERS