1. Which of the following is a characteristic of clustering?
- A) It is a type of supervised learning.
- B) It is a type of unsupervised learning.
- C) It requires labeled data.
- D) It involves predictions based on historical data.
Answer: B) It is a type of unsupervised learning.
Explanation: Clustering is an unsupervised learning technique used to group similar data points together without predefined labels.
2. What does the K-means algorithm aim to minimize?
- A) The number of clusters
- B) The variance within each cluster
- C) The distance between clusters
- D) The total number of data points
Answer: B) The variance within each cluster
Explanation: K-means aims to minimize the sum of squared distances (variance) between the data points and their assigned cluster centers (centroids).
3. In the K-means clustering algorithm, what does “K” represent?
- A) The number of features in the dataset
- B) The number of nearest neighbors
- C) The number of clusters to form
- D) The number of iterations
Answer: C) The number of clusters to form
Explanation: “K” represents the number of clusters the algorithm will try to create from the data.
4. What is the main disadvantage of K-means clustering?
- A) It requires labeled data for training.
- B) It is sensitive to the initial placement of the centroids.
- C) It works only with numeric data.
- D) It cannot handle large datasets.
Answer: B) It is sensitive to the initial placement of the centroids.
Explanation: K-means can produce different results depending on the initial random placement of centroids, and it may converge to a local minimum.
5. Which of the following clustering algorithms does NOT require the number of clusters to be specified beforehand?
- A) K-means clustering
- B) DBSCAN
- C) Hierarchical clustering
- D) Both B and C
Answer: D) Both B and C
Explanation: DBSCAN and hierarchical clustering do not require the number of clusters to be specified in advance. DBSCAN uses density-based clustering, while hierarchical clustering produces a tree-like structure (dendrogram) of clusters.
6. In DBSCAN (Density-Based Spatial Clustering of Applications with Noise), what does the term “epsilon” (ε) refer to?
- A) The maximum number of clusters that can be formed
- B) The distance threshold for considering points as neighbors
- C) The number of iterations to run
- D) The density of the data
Answer: B) The distance threshold for considering points as neighbors
Explanation: Epsilon (ε) is the maximum distance between two points for them to be considered neighbors in DBSCAN.
7. What is the main goal of hierarchical clustering?
- A) To partition the data into a predefined number of clusters
- B) To group data into a tree-like structure of nested clusters
- C) To minimize the within-cluster variance
- D) To maximize the distance between clusters
Answer: B) To group data into a tree-like structure of nested clusters
Explanation: Hierarchical clustering creates a hierarchy of clusters by successively merging or splitting them based on similarity, resulting in a dendrogram.
8. Which of the following clustering algorithms is best suited for detecting outliers?
- A) K-means clustering
- B) DBSCAN
- C) K-medoids
- D) Agglomerative clustering
Answer: B) DBSCAN
Explanation: DBSCAN is effective at detecting outliers because it labels points that do not meet the density requirement as noise (outliers).
9. In K-means clustering, what happens if “K” is set too high?
- A) The model overfits, creating very small clusters.
- B) The model underfits, creating large clusters.
- C) The algorithm converges more quickly.
- D) The clusters become less informative.
Answer: A) The model overfits, creating very small clusters.
Explanation: Setting “K” too high leads to overfitting, where many small, insignificant clusters are created, reducing the interpretability of the model.
10. Which of the following methods is used to determine the optimal number of clusters in K-means clustering?
- A) Elbow method
- B) Silhouette score
- C) Gap statistic
- D) All of the above
Answer: D) All of the above
Explanation: The Elbow method, Silhouette score, and Gap statistic are all techniques used to help determine the optimal number of clusters in K-means clustering.
11. What is the main difference between K-means and K-medoids clustering?
- A) K-means uses the mean of the data points in each cluster, while K-medoids uses the actual data points (medoids) as cluster centers.
- B) K-means is faster than K-medoids.
- C) K-medoids can only be used for numeric data, while K-means can handle categorical data.
- D) K-medoids requires a predefined number of clusters, while K-means does not.
Answer: A) K-means uses the mean of the data points in each cluster, while K-medoids uses the actual data points (medoids) as cluster centers.
Explanation: K-medoids, like K-means, divides data into clusters, but instead of using the mean, it selects an actual data point (medoid) as the center of each cluster.
12. What type of data is DBSCAN best suited for?
- A) Data that is linearly separable
- B) Data with noise and varying densities
- C) Large datasets with few outliers
- D) Data with a fixed number of clusters
Answer: B) Data with noise and varying densities
Explanation: DBSCAN is ideal for datasets with varying densities and noise because it does not require the number of clusters to be specified and can identify outliers as noise.
13. Which clustering technique is most appropriate for data that follows a tree-like structure?
- A) K-means
- B) DBSCAN
- C) Hierarchical clustering
- D) Gaussian mixture models
Answer: C) Hierarchical clustering
Explanation: Hierarchical clustering is ideal for tree-like structures, as it builds a hierarchy of clusters that can be visualized as a dendrogram.
14. What is the “Silhouette Score” used for in clustering?
- A) To measure the average size of the clusters
- B) To measure how similar each point is to its own cluster compared to other clusters
- C) To find the optimal number of clusters for K-means
- D) To determine the number of features in the dataset
Answer: B) To measure how similar each point is to its own cluster compared to other clusters
Explanation: The Silhouette Score is a metric used to evaluate the quality of clustering by measuring the cohesion and separation of clusters.
15. In hierarchical clustering, which of the following methods merges clusters based on the shortest distance between any two points in the clusters?
- A) Single linkage
- B) Complete linkage
- C) Average linkage
- D) Ward’s method
Answer: A) Single linkage
Explanation: Single linkage merges clusters based on the shortest distance between any two points in the clusters (also known as nearest point linkage).