Question
A data analyst is assessing a dataset with inconsistent
categorical entries, such as "USA," "U.S.A," "United States," and "US" for the country field. Which of the following is the best approach for handling this inconsistency?Solution
Standardizing categorical entries to a single representation ensures consistency by consolidating multiple formats of the same entity into one standardized label. For example, consolidating "USA," "U.S.A," "United States," and "US" into one uniform label, like "United States," ensures that all data entries are interpreted consistently. This process is essential in data cleaning, as inconsistencies in categorical data can lead to inaccurate analysis, skewed results, and duplications in reporting. A uniform categorical format enables reliable grouping, sorting, and filtering for analysis. The other options are incorrect because: • Option 1 (Filtering duplicates) removes identical rows but doesn’t address inconsistency in a single field. • Option 2 (Using normalization) only applies to numeric scaling, not categorical consistency. • Option 3 (Applying data transformation) would encode inconsistencies rather than correct them. • Option 5 (Converting to uppercase) helps with case sensitivity but does not fully standardize variations.
What is the primary difference between Big Data and traditional data ?
In Python, which method in the Pandas library would you use to replace NaN values in a DataFrame with the median value of each column?
Which of the following Big Data processing models is based on the concept of continuous data flow processing?Â
Which of the following statements about the mean, median, and mode is true when applied to a dataset with a skewed distribution?
Which of the following is an effective method for handling inconsistent data in a merged dataset?
Which of the following testing types focuses on validating that a system meets its functional requirements?
What is the result of the following SQL query?
 SELECT department, COUNT (employee_id)Â
FROM employees GROUPBY department HAVING COUNT (e...If a dataset has a strongly skewed distribution , which measure of central tendency is most appropriate to represent the data?
In time series forecasting, what is the primary role of the ARIMA model ?
Which of the following factors should primarily determine the sample size in a data analysis project?