k-Means clustering MCQs

1. In K-means clustering, what does the “K” represent?

A) The number of features
B) The number of clusters
C) The number of iterations
D) The number of nearest neighbors

Answer: B) The number of clusters
Explanation: “K” represents the number of clusters that the K-means algorithm will create from the dataset.

2. Which of the following is the objective of the K-means algorithm?

A) Minimize the within-cluster variance
B) Maximize the distance between clusters
C) Minimize the number of clusters
D) Maximize the number of features in each cluster

Answer: A) Minimize the within-cluster variance
Explanation: The goal of K-means is to partition the data into K clusters that minimize the sum of squared distances (variance) between the data points and their respective cluster centroids.

3. What is the first step in the K-means clustering algorithm?

A) Assign data points to the nearest cluster centroid
B) Calculate the distance between all points and centroids
C) Randomly initialize K centroids
D) Calculate the mean of all data points

Answer: C) Randomly initialize K centroids
Explanation: The K-means algorithm begins by randomly selecting K initial centroids (cluster centers) from the dataset.

4. What happens if the value of “K” is set too high in K-means clustering?

A) The model overfits, creating very small clusters.
B) The model underfits, creating fewer clusters.
C) The algorithm will not converge.
D) The clusters will become more distinct.

Answer: A) The model overfits, creating very small clusters.
Explanation: Setting “K” too high can lead to overfitting, where many small, less meaningful clusters are formed, thus reducing the interpretability of the clustering solution.

5. Which of the following is a limitation of the K-means algorithm?

A) It works only with numerical data.
B) It is sensitive to the initial placement of the centroids.
C) It does not scale well with large datasets.
D) It requires the number of clusters to be automatically determined.

Answer: B) It is sensitive to the initial placement of the centroids.
Explanation: K-means can converge to local optima depending on the initial positions of the centroids, which can lead to different results for different initializations.

6. Which of the following methods can be used to select the optimal number of clusters (K) for K-means clustering?

A) Elbow method
B) Silhouette score
C) Gap statistic
D) All of the above

Answer: D) All of the above
Explanation: The Elbow method, Silhouette score, and Gap statistic are all commonly used methods to determine the optimal value of K in K-means clustering.

7. Which distance metric is typically used in K-means clustering to calculate the distance between points and centroids?

A) Manhattan distance
B) Euclidean distance
C) Minkowski distance
D) Cosine similarity

Answer: B) Euclidean distance
Explanation: K-means typically uses Euclidean distance to measure the similarity between data points and centroids, as it is a natural choice for clustering with continuous numerical data.

8. What is the role of centroids in K-means clustering?

A) They represent the center of a cluster.
B) They are data points that are assigned to clusters.
C) They measure the distance between clusters.
D) They are the final outputs of the clustering process.

Answer: A) They represent the center of a cluster.
Explanation: Centroids are the central points that represent the mean position of all the data points within a cluster in K-means clustering.

9. In which of the following situations is K-means clustering likely to perform poorly?

A) When the clusters are globular and well-separated
B) When the data has outliers or noise
C) When the data has only a few features
D) When the number of clusters is very large

Answer: B) When the data has outliers or noise
Explanation: K-means is sensitive to outliers because they can significantly affect the position of the centroids, leading to suboptimal clustering results.

10. In K-means clustering, what does the “assignment step” involve?

A) Recalculating the centroids
B) Assigning each data point to the nearest centroid
C) Determining the optimal number of clusters
D) Initializing the centroids randomly

Answer: B) Assigning each data point to the nearest centroid
Explanation: In the assignment step, each data point is assigned to the closest centroid, forming the clusters.

11. What happens after the assignment step in K-means clustering?

A) The algorithm stops and outputs the final clusters.
B) The centroids are updated by calculating the mean of the assigned points.
C) The data points are rearranged in a different order.
D) The algorithm checks for convergence and stops if clusters are not changing.

Answer: B) The centroids are updated by calculating the mean of the assigned points.
Explanation: After assigning points to the closest centroids, the centroids are recalculated by taking the mean of all points assigned to each centroid.

12. Which of the following is a valid way to handle categorical data in K-means clustering?

A) Convert the data into numerical format using encoding techniques like one-hot encoding.
B) Use K-means directly on categorical data without any transformation.
C) Apply distance metrics like Euclidean to categorical data.
D) K-means cannot be applied to categorical data.

Answer: A) Convert the data into numerical format using encoding techniques like one-hot encoding.
Explanation: K-means requires numerical data, so categorical data must be transformed (e.g., through one-hot encoding) before applying K-means clustering.

13. Which of the following is a major advantage of K-means clustering?

A) It can handle large datasets efficiently.
B) It automatically handles outliers.
C) It works well with both numerical and categorical data.
D) It guarantees a global optimal solution.

Answer: A) It can handle large datasets efficiently.
Explanation: K-means is computationally efficient and can handle large datasets, making it a popular choice for clustering tasks on big data.

14. In K-means clustering, what happens during the “update step”?

A) The centroids are randomly initialized.
B) Each point is reassigned to a new cluster.
C) The centroids are recalculated based on the mean of assigned points.
D) The number of clusters is increased or decreased.

Answer: C) The centroids are recalculated based on the mean of assigned points.
Explanation: In the update step, the centroids are recalculated by finding the mean of all points assigned to each centroid.

15. What is the “Elbow Method” used for in K-means clustering?

A) To calculate the distance between clusters
B) To determine the optimal number of clusters (K)
C) To optimize the placement of centroids
D) To identify the outliers in the dataset

Answer: B) To determine the optimal number of clusters (K)
Explanation: The Elbow Method involves plotting the sum of squared distances within clusters for different values of K and identifying the “elbow” point, where the rate of decrease slows, indicating the optimal number of clusters.