ASSESS SUPERSTEP
5.0 OBJECTIVES
This chapter makes you understand the following concepts:
• Dealing with errors in data
• Principles of data analysis
• Different ways to correct errors in data
5.1 ASSESS SUPERSTEP
Data quality problems result in a 20% decrease in worker productivity and explain why 40% of business initiatives fail to achieve set goals. Incorrect data can harm a reputation, misdirect resources, slow down the retrieval of information, and lead to false insights and missed opportunities.
For example, if an organization has the incorrect name or mailing address of a prospective client, their marketing materials could go to the wrong recipient. If sales data is attributed to the wrong SKU or brand, the company might invest in a product line with less than stellar customer demand.
Data profiling is the process of examining, analyzing and reviewing data to collect statistics surrounding the quality and hygiene of the dataset. Data quality refers to the accuracy, consistency, validity and completeness of data. Data profiling may also be known as data archeology, data assessment, data discovery or data quality analysis.
5.2 ERRORS
Errors are the norm, not the exception, when working with data. By now, you’ve probably heard the statistic that 88% of spreadsheets contain errors. Since we cannot safely assume that any of the data we work with is error-free, our mission should be to find and tackle errors in the most efficient way possible.
5.2.1 Accept the
Error:
If an error falls within an acceptable standard (i.e., Navi Mumbai instead of Navi Mum.), then it could be accepted and move on to the next data entry. But remember that if you accept the error, you will affect data science techniques and algorithms that perform classification, such as binning, regression, clustering, and decision trees, because these processes assume that the values in this example are not the same. This option is the easy option, but not always the best option.
5.2.2 Reject the
Error:
Unless the nature of missing data is ‘Missing completely at random’, the best avoidable method in many cases is deletion. a. Listwise: In this case, rows containing missing variables are deleted.
a. Listwise: In this case, rows containing missing variables are deleted
In the above case, the entire observation for User C and User D will be ignored for listwise deletion. b. Pairwise: In this case, only the missing observations are ignored and analysis is In the above case, 2 separate sample data will be analyzed, one with the combination of User, Device and Transaction and the other with the combination of User, OS and Transaction. In such a case, one won't be deleting any observation. Each of the samples will ignore the variable which has the missing value in it.
Both the above methods suffer from loss of information. Listwise deletion suffers the maximum information loss compared to Pairwise deletion. But, the problem with pairwise deletion is that even though it takes the available cases, one can’t compare analyses because the sample is different every time.
Use reject the error option if you can afford to lose a bit of data. This is an option to be used only if the number of missing values is 2% of the whole dataset or less.
5.2.3 Correct the
Error:
Identify the
Different Error Types:
We are going to look at a few different types of errors. Let’s take the example of a sample of people described by a number of different variables:
Can you point out a few inconsistencies? Write them down a few and check your answers below!
1. First, there are empty cells for the "country" and "date of birth variables". We call these missing attributes.
2. If you look at the "Country" column, you see a cell that contains 24. “24” is definitely not a country! This is known as a lexical error.
3. Next, you may notice in the "Height" column that there is an entry with a different unit of measure. Indeed, Rodney's height is recorded in feet and inches while the rest are recorded in meters. This is an irregularity error because the unit of measures are not uniform.
4. Mark has two email addresses. It’s is not necessarily a problem, but if you forget about this and code an analysis program based on the assumption that each person has only one email address, your program will probably crash! This is called a formatting error.
5. Look at the "date of birth" variable. There is also a formatting error here as Rob’s date of birth is not recorded in the same format as the others.
6. Samuel appears on two different rows. But, how can we be sure this is the same Samuel? By his email address, of course! This is called a duplication error. But look closer, Samuel’s two rows each give a different value for the "height variable": 1.67m and 1.45m. This is called a contradiction error.
7. Honey is apparently 9'1". This height diverges greatly from the normal heights of human beings. This value is, therefore, referred to as an outlier.
The term outlier can indicate two different things: an atypical value and an aberration.
Deal With These
Errors:
When it comes to cleansing data sets, there is no set rule. Everything you do depends on how you plan to use your data. No two data analysts will cleanse the same data set the same way—not if their objectives are different!
So there’s no set rule, but I can give you a few pointers:
1. Missing attributes will be addressed in the following chapter.
2. For the invalid country, it’s possible to supply a list of authorized countries in advance, then eliminate all of the values that are not found on this list (hint: 24 will not be found). Such a list is often referred to as a dictionary.
3. For irregularity errors, it’s more complicated! You can, for example, set a fixed format (here: a decimal number followed by the letter “m” for “meter”) and eliminate values that don’t adhere to it. But we can do better, by first detecting what unit the value is expressed in (meters or centimeters) then converting everything to the same unit.
4. For the formatting error of the duplicate email address, it all depends on what you want to do. If you won’t be looking at emails in your future analysis, there’s no need to correct this error. If, on the other hand, you want to know the proportion of people whose address ends in, for example @example.com, or @supermail.eu, etc., then you can choose between:
1. Taking the first email address and forgetting the second one.
2. Keeping all email addresses.
5. Let’s move on to the Date of Birth variable. There are many different formats; each country has its own custom when it comes to writing dates (India and North America, for example, do not use the same format). Add to this the problem of time zones! In our case, the simplest solution would be to eliminate dates that are not in the desired format month/day/year.
6. Duplicates.
7. Outliers!
5.2.4 Create a
Default Value:
NaN is the default missing value marker for reasons of computational speed and convenience. This is a sentinel value, in the sense that it is a dummy data or flag value that can be easily detected and worked with using functions in pandas.
5.3 ANALYSIS OF DATA
One of the causes of data quality issues is in source data that is housed in a patchwork of operational systems and enterprise applications. Each of these data sources can have scattered or misplaced values, outdated and duplicate records, and inconsistent (or undefined) data standards and formats across customers, products, transactions, financials and more.
Data quality problems can also arise when an enterprise consolidates data during a merger or acquisition. But perhaps the largest contributor to data quality issues is that the data are being entered, edited, maintained, manipulated and reported on by people.
To maintain the accuracy and value of the business-critical operational information that impact strategic decision-making, businesses should implement a data quality strategy that embeds data quality techniques into their business processes and into their enterprise applications and data integration.
5.3.1 Completeness:
Completeness is defined as expected comprehensiveness. Data can be complete even if optional data is missing. As long as the data meets the expectations then the data is considered complete.
For example, a customer’s first name and last name are mandatory but middle name is optional; so a record can be considered complete even if a middle name is not available.
Questions you can ask yourself: Is all the requisite information available? Do any data values have missing elements? Or are they in an unusable state?
5.3.2 Consistency:
Consistency means data across all systems reflects the same information and are in synch with each other across the enterprise.
Examples:
• A business unit status is closed but there are sales for that business unit.
• Employee status is terminated but pay status is active.
Questions you can ask yourself: Are data values the same across the data sets? Are there any distinct occurrences of the same data instances that provide conflicting information?
5.3.3 Timeliness:
Timeliness referes to whether information is available when it is expected and needed. Timeliness of data is very important. This is reflected in:
• Companies that are required to publish their quarterly results within a given frame of time
• Customer service providing up-to date information to the customers
• Credit system checking in real-time on the credit card account activity
The timeliness depends on user expectation. Online availability of data could be required for room allocation system in hospitality, but nightly data could be perfectly acceptable for a billing system.
5.3.4 Conformity:
Conformity means the data is following the set of standard data definitions like data type, size and format. For example, date of birth of customer is in the format “mm/dd/yyyy” Questions you can ask yourself: Do data values comply with the specified formats? If so, do all the data values comply with those formats?
Maintaining conformance to specific formats is important.
5.3.5 Accuracy:
Accuracy is the degree to which data correctly reflects the real world object OR an event being described. Examples:
• Sales of the business unit are the real value.
• Address of an employee in the employee database is the real address.
Questions you can ask yourself: Do data objects accurately represent the “real world” values they are expected to model? Are there incorrect spellings of product or person names, addresses, and even untimely or not current data?
These issues can impact operational and advanced analytics applications.
5.3.6 Integrity:
Integrity means validity of data across the relationships and ensures that all data in a database can be traced and connected to other data.
For example, in a customer database, there should be a valid customer, addresses and relationship between them. If there is an address relationship data without a customer then that data is not valid and is considered an orphaned record.
Ask yourself: Is there are any data missing important relationship linkages? The inability to link related records together may actually introduce duplication across your systems.
5.4 PRACTICAL ACTIONS
In Unit 2, you have been introduced to the Python package pandas. The package enables several automatic error-management features.
5.4.1 Missing Values
in Pandas:
Following are four basic processing concepts.
1. Drop the Columns Where All Elements Are Missing Values
2. Drop the Columns Where Any of the Elements Is Missing Values
3. Keep Only the Rows That Contain a Maximum of Two Missing Values
4. Fill All Missing Values with the Mean, Median, Mode, Minimum
5.4.1.1. Drop the
Columns Where All Elements Are Missing Values Importing data:
Step 1: Importing
necessary libraries:
import os
import pandas as pd
Step 2: Changing the
working directory:
os.chdir("D:\Pandas")
Pandas provides various data structures and operations for manipulating numerical data and time series. However, there can be cases where some data might be missing. In Pandas missing data is represented by two values:
• None: None is a Python singleton object that is often used for missing data in Python code.
• NaN: NaN (an acronym for Not a Number), is a special floating[1]point value recognized by all systems that use the standard IEEE floating-point representation
Pandas treat None and NaN as essentially interchangeable for indicating missing or null values. In order to drop a null values from a dataframe, we used dropna() function this function drop Rows/Columns of datasets with Null values in different ways.
Syntax:
DataFrame.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
Parameters:
• axis: axis takes int or string value for rows/columns. Input can be 0 or 1 for Integer and ‘index’ or ‘columns’ for String.
• how: how takes string value of two kinds only (‘any’ or ‘all’). ‘any’ drops the row/column if ANY value is Null and ‘all’ drops only if ALL values are null.
• thresh: thresh takes integer value which tells minimum amount of na values to drop.
• subset: It’s an array which limits the dropping process to passed rows/columns through list.
• inplace: It is a boolean which makes the changes in data frame itself if True.
Let’s take an example of following dataframe:
Here, column C is having all NaN values. Let’s drop this column. For this use the following code.
Code:
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5]],
columns=list('ABCD'))
df # it will print the data frame
df.dropna(axis=1, how='all') # this code will delete the columns with all null values.
Here, axis=1 means columns and how=’all’ means drop the columns with all NaN values.
5.4.1.2. Drop the
Columns Where Any of the Elements Is Missing Values:Let’s consider the same
dataframe again:
Here, column A, B and C are having all NaN values. Let’s drop these columns. For this use the following code
Code:
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5]],
columns=list('ABCD'))
df # it will print the data frame
df.dropna(axis=1, how='any') # this code will delete the columns with all null values.
Here, axis=1 means columns and how=’any’ means drop the columns with one or nore NaN values.
5.4.1.3. Keep Only
the Rows That Contain a Maximum of Two Missing Values:
Let’s consider the same dataframe again:
Here, row 2 is having more than 2 NaN values. So, this row will get dropped. For this use the following code
Code:
# importing pandas as pd
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5]],
columns=list('ABCD'))
df
df.dropna(thresh=2)
# this code will delete the rows with more than two null values
Here, thresh=2 means maximum two NaN will be allowed per row.
5.4.1.4. Fill All
Missing Values with the Mean, Median, Mode, Minimum
Another approach to handling missing values is to impute or estimate them. Missing value imputation has a long history in statistics and has been thoroughly researched. In essence, imputation uses information and relationships among the non-missing predictors to provide an estimate to fill in the missing value. The goal of these techniques is to ensure that the statistical distributions are tractable and of good enough quality to support subsequent hypothesis testing. The primary approach in this scenario is to use multiple imputations; several variations of the data set are created with different estimates of the missing values. The variations of the data sets are then used as inputs to models and the test statistic replicates are computed for each imputed data set. From these replicate statistics, appropriate hypothesis tests can be constructed and used for decision making.
A simple guess of a missing value is the mean, median, or mode (most frequently appeared value) of that variable.
Replacing Nan values
with mean:
In pandas, .fillna can be used to replace NA’s with a specified value.
Here, we can see NaN in all the columns. Let’s fill it by their mean. For this, use the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame([[10, np.nan, 30, 40], [7, 14, 21, 28], [55, np.nan, 8,
12],
[15, 14, np.nan, 8], [7, 1, 1, np.nan], [np.nan, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])
df
df.fillna(df.mean())
Output:
Here, the mean of Apple Column = (10 + 7 + 55 + 15 + 7)/5 = 18.8. So, Nan value is replaced by 18.8. Similarly, in Orange Column Nan’s are replaced with 8.25, in Banana’s column Nan replaced with 13.8 and in Pear’s column it is replaced with 18.
Replacing Nan values
with median:
Let’s take an example:
Here, we can see NaN in all the columns. Let’s fill it by their median. For this, use the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame([[10, np.nan, 30, 40], [7, 14, 21, 28], [55, np.nan, 8, 12], [15, 14, np.nan, 8], [7, 1, 1, np.nan], [np.nan, 4, 9, 2]], columns=['Apple', 'Orange', 'Banana', 'Pear'], index=['Basket1', 'Basket2', 'Basket3', 'Basket4', 'Basket5', 'Basket6'])
df
df.fillna(df.median())
Output:
Here, the median of Apple Column = (7, 7, 10, 15, 55) = 10. So, Nan value is replaced by 10. Similarly, in Orange Column Nan’s are replaced with 9, in Banana’s column Nan replaced with 9 and in Pear’s column it is replaced with 12.
Replacing Nan values
with mode:
Let’s take an example
Here, we can see NaN in all the columns. Let’s fill it by their mode. For this, use the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame([[10, np.nan, 30, 40], [7, 14, 8, 28], [55, np.nan,
8, 12],
[15, 14, np.nan, 12], [7, 1, 1, np.nan], [np.nan, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])
Df
for column in df.columns:
df[column].fillna(df[column].mode()[0], inplace=True)
df
Output:
Here, the mode of Apple Column = (10, 7, 55, 15, 7) = 7. So, Nan value is replaced by 7. Similarly, in Orange Column Nan’s are replaced with 14, in Banana’s column Nan replaced with 8 and in Pear’s column it is replaced with 12.
Replacing Nan values
with min:
Let’s take an example:
Here, we can see NaN in all the columns. Let’s fill it by their minimum value. For this, use the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame([[10, np.nan, 30, 40], [7, 14, 21, 28], [55, np.nan, 8,
12],
[15, 14, np.nan, 8], [7, 1, 1, np.nan], [np.nan, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])
df
df.fillna(df.min())
Output:
Here, the minimum of Apple Column = (10, 7, 55, 15, 7) = 7. So, Nan value is replaced by 7. Similarly, in Orange Column Nan’s are replaced with 1, in Banana’s column Nan replaced with 1 and in Pear’s column it is replaced with 2.
5.5 LET US SUM UP
This chapter focuses on dealing with errors in data. The main concepts related to errors are: accept the errors. This is very crucial. Another way is to reject the errors. This step can be used if you can take this risk and not more than 10-15% data is to be compromised. Another way is to correct the error. To correct the errors, there are different practical solutions available like using different error correction methods.
Principles of data analysis were also discussed.
Practical Solutions to solve the missing values were also covered like Drop the Columns Where All Elements Are Missing Values, Drop the Columns Where Any of the Elements Is Missing Values, and keep Only the Rows That Contain a Maximum of Two Missing Values, Fill All Missing Values with the Mean, Median, Mode, Minimum, and Maximum of the Particular Numeric Column
5.6 LIST OF
REFERENCES
• Python for Data Science For Dummies, by Luca Massaron John Paul Mueller (Author),
• ISBN-13 : 978-8126524938, Wiley
• Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, 2nd Edition
• by William McKinney (Author), ISBN-13 : 978-9352136414 , Shroff/O'Reilly
• Data Science From Scratch: First Principles with Python, Second Edition by Joel Grus,
• ISBN-13 : 978-9352138326, Shroff/O'Reilly
• Data Science from Scratch by Joel Grus, ISBN-13 : 978-1491901427 , O′Reilly
• Data Science Strategy For Dummies by Ulrika Jagare, ISBN-13 : 978-8126533367 , Wiley
• Pandas for Everyone: Python Data Analysis, by Daniel Y. Chen, ISBN-13 : 978- 9352869169, Pearson Education
• Practical Data Science with R (MANNING) by Nina Zumel, John Mount, ISBN-13 : 978- 9351194378, Dreamtech Press
5.7 UNIT END
QUESTIONS
1. Explain error
2. Explain the different ways to deal with errors.
3. Explain the principles of data analysis.
4. How you will handle missing values in Pandas? Explain.
5. Write a python program to Drop the Columns Where All Elements Are Missing Values.
6. Write a python program to Drop the Columns Where Any of the Elements Is Missing Values.
7. Write a python program to keep Only the Rows That Contain a Maximum of Two Missing Values.
8. Write a python program to Fill All Missing Values with the Mean of the particular column.
9. Write a python program to Fill All Missing Values with the Median of the particular column.
10. Write a python program to Fill All Missing Values with the Mode of the particular column.
11. Write a python program to Fill All Missing Values with the Minimum of the particular column.
12. Write a python program to Fill All Missing Values with the Maximum of the particular column.
No comments:
Post a Comment
Tell your requirements and How this blog helped you.