Data cleaning and integration MCQs

1. What is the primary goal of data cleaning?

a) To remove redundant features b) To remove noise and correct errors in the dataset c) To reduce the size of the dataset d) To increase the dataset’s complexity Answer: b) To remove noise and correct errors in the dataset

2. Which of the following is NOT typically considered a data quality issue that data cleaning addresses?

a) Missing data b) Duplicate records c) Incorrect formatting d) Data visualization Answer: d) Data visualization

3. What is one common method used to handle missing values in a dataset?

a) Data transformation b) Imputation c) Data mining d) Data reduction Answer: b) Imputation

4. In data cleaning, what does data deduplication refer to?

a) Removing irrelevant features from the dataset b) Combining datasets from multiple sources c) Identifying and removing duplicate records d) Correcting misformatted data Answer: c) Identifying and removing duplicate records

5. Which of the following methods is commonly used to handle outliers in the data cleaning process?

a) One-hot encoding b) Normalization c) Z-score transformation d) Data imputation Answer: c) Z-score transformation

6. What is data integration in the context of data preprocessing?

a) Combining data from multiple sources into a unified dataset b) Scaling data to a common range c) Converting data from one format to another d) Removing duplicate entries from the dataset Answer: a) Combining data from multiple sources into a unified dataset

7. What is a challenge commonly faced during data integration?

a) Ensuring all data sources have the same format b) Normalizing numerical values c) Dealing with missing values d) Removing outliers Answer: a) Ensuring all data sources have the same format

8. What does schema integration refer to during data integration?

a) Handling conflicts between data types b) Merging datasets with the same attributes c) Combining data based on common attributes d) Resolving discrepancies in data definitions across different sources Answer: d) Resolving discrepancies in data definitions across different sources

9. In data cleaning, what does standardization typically involve?

a) Removing rows with missing values b) Scaling numerical data to a standard range c) Converting categorical data into numeric form d) Ensuring consistency in data formatting and units Answer: d) Ensuring consistency in data formatting and units

10. When integrating data from multiple sources, which issue is likely to arise?

a) Data transformation b) Inconsistent data formats c) Feature selection d) Data visualization Answer: b) Inconsistent data formats

11. Which technique can be used to handle categorical data when performing data cleaning?

a) Imputation b) One-hot encoding c) Data reduction d) Normalization Answer: b) One-hot encoding

12. What is the best approach when data contains outliers that cannot be removed?

a) Impute missing values with the median b) Apply robust models that are less sensitive to outliers c) Perform normalization d) Ignore them, as they do not affect the model Answer: b) Apply robust models that are less sensitive to outliers

13. Which of the following is an important step during the data cleaning process to ensure accurate analysis?

a) Removing all missing data b) Removing irrelevant features c) Ensuring data consistency and integrity d) Reducing the dimensionality of data Answer: c) Ensuring data consistency and integrity

14. Which of the following is an example of data imputation in data cleaning?

a) Replacing missing values with the mean of the column b) Removing rows with missing data c) Merging data from multiple sources d) Scaling numerical values to a range between 0 and 1 Answer: a) Replacing missing values with the mean of the column

15. What does entity resolution aim to achieve in data integration?

a) Standardizing data formats b) Identifying and merging records that refer to the same real-world entity c) Removing duplicates within a single dataset d) Mapping data to a target schema Answer: b) Identifying and merging records that refer to the same real-world entity