1. What is the primary goal of data cleaning?
a) To remove redundant features
b) To remove noise and correct errors in the dataset
c) To reduce the size of the dataset
d) To increase the dataset’s complexity
Answer: b) To remove noise and correct errors in the dataset
2. Which of the following is NOT typically considered a data quality issue that data cleaning addresses?
a) Missing data
b) Duplicate records
c) Incorrect formatting
d) Data visualization
Answer: d) Data visualization
3. What is one common method used to handle missing values in a dataset?
a) Data transformation
b) Imputation
c) Data mining
d) Data reduction
Answer: b) Imputation
4. In data cleaning, what does data deduplication refer to?
a) Removing irrelevant features from the dataset
b) Combining datasets from multiple sources
c) Identifying and removing duplicate records
d) Correcting misformatted data
Answer: c) Identifying and removing duplicate records
5. Which of the following methods is commonly used to handle outliers in the data cleaning process?
a) One-hot encoding
b) Normalization
c) Z-score transformation
d) Data imputation
Answer: c) Z-score transformation
6. What is data integration in the context of data preprocessing?
a) Combining data from multiple sources into a unified dataset
b) Scaling data to a common range
c) Converting data from one format to another
d) Removing duplicate entries from the dataset
Answer: a) Combining data from multiple sources into a unified dataset
7. What is a challenge commonly faced during data integration?
a) Ensuring all data sources have the same format
b) Normalizing numerical values
c) Dealing with missing values
d) Removing outliers
Answer: a) Ensuring all data sources have the same format
8. What does schema integration refer to during data integration?
a) Handling conflicts between data types
b) Merging datasets with the same attributes
c) Combining data based on common attributes
d) Resolving discrepancies in data definitions across different sources
Answer: d) Resolving discrepancies in data definitions across different sources
9. In data cleaning, what does standardization typically involve?
a) Removing rows with missing values
b) Scaling numerical data to a standard range
c) Converting categorical data into numeric form
d) Ensuring consistency in data formatting and units
Answer: d) Ensuring consistency in data formatting and units
10. When integrating data from multiple sources, which issue is likely to arise?
a) Data transformation
b) Inconsistent data formats
c) Feature selection
d) Data visualization
Answer: b) Inconsistent data formats
11. Which technique can be used to handle categorical data when performing data cleaning?
a) Imputation
b) One-hot encoding
c) Data reduction
d) Normalization
Answer: b) One-hot encoding
12. What is the best approach when data contains outliers that cannot be removed?
a) Impute missing values with the median
b) Apply robust models that are less sensitive to outliers
c) Perform normalization
d) Ignore them, as they do not affect the model
Answer: b) Apply robust models that are less sensitive to outliers
13. Which of the following is an important step during the data cleaning process to ensure accurate analysis?
a) Removing all missing data
b) Removing irrelevant features
c) Ensuring data consistency and integrity
d) Reducing the dimensionality of data
Answer: c) Ensuring data consistency and integrity
14. Which of the following is an example of data imputation in data cleaning?
a) Replacing missing values with the mean of the column
b) Removing rows with missing data
c) Merging data from multiple sources
d) Scaling numerical values to a range between 0 and 1
Answer: a) Replacing missing values with the mean of the column
15. What does entity resolution aim to achieve in data integration?
a) Standardizing data formats
b) Identifying and merging records that refer to the same real-world entity
c) Removing duplicates within a single dataset
d) Mapping data to a target schema
Answer: b) Identifying and merging records that refer to the same real-world entity