Industries Needs: DATA SCIENCE TECHNOLOGY STACK

TRANSFORM SUPERSTEP

8.0 OBJECTIVES

The objective of this chapter is to learn data transformation techniques, feature extraction techniques, missing datahandling, and various techniques to categorise data into suitable groups.

8.1 INTRODUCTION

The Transform Superstep allow us to take data from data vault and answer the questions raised by the investigation.

It takes standard data science techniques and methods to attain insight and knowledge about the data that then can be transformed into actionable decisions. These results can be explained to non-data scientist.

The Transform Superstep uses the data vault from the process step as its source data.

8.2 DIMENSION CONSOLIDATION

The data vault consists of five categories of data, with linked relationships and additional characteristics in satellite hubs.

To perform dimension consolidation, you start with a given relationship in the data vault and construct a sun model for that relationship, as shown in Figure

8.3 SUN MODEL

Sun model technique is used by data scientist to perform consistent dimension consolidation. It allows us to explain data relationship with the business without going in technical details.

8.3.1 Person-to-Time Sun Model:

Person-to-Time Sun Model explains the relationship between the Person and Time categories in the data vault. The sun model is constructed to show all the characteristics from the two data vault hub categories. It explains how you will create two dimensions and a fact via the Transform step from below figure.

The sun model is constructed to show all the characteristics from the two data vault hub categories you are planning to extract. It explains how you will create two dimensions and a fact via the Transform step from above figure. You will create two dimensions (Person and Time) with one fact (PersonBornAtTime) as shown in below figure,

8.3.2 Person-to-Object Sun Model:

Person-to-Object Sun Model explains the relationship between the Person and Object categories in the data vault. The sun model is constructed to show all the characteristics from the two data vault hub categories. It explains how you will create two dimensions and a fact via the Transform step from Figure.

8.3.3 Person-to-Location Sun Model:

Person-to-Location Sun Model explains the relationship between the Person and Location categories in the data vault. The sun model is constructed to show all the characteristics from the two data vault hub categories. It explains how you will create two dimensions and a fact via the Transform step from Figure.

8.3.4 Person-to-Event Sun ModeL:

Person-to-Event Sun Model explains the relationship between the Person and Event categories in the data vault.

8.3.5 Sun Model to Transform Step:

You must build three items: dimension Person, dimension Time, and fact.

PersonBornAtTime. Open your Python editor and create a file named Transform[1]

import sys

import os

from datetime import datetime

from pytz import timezone

import pandas as pd

import sqlite3 as sq

import uuid

pd.options.mode.chained_assignment = None

############################################################

####

if sys.platform == 'linux' or sys.platform == ' Darwin':

Base=os.path.expanduser('~') + '/VKHCG'

else:

Base='C:/VKHCG'

print('################################')

print('Working Base :',Base, ' using ', sys.platform)

print('################################')

############################################################

####

Company='01-Vermeulen'

############################################################

####

sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite'

if not os.path.exists(sDataBaseDir):

os.makedirs(sDataBaseDir)

sDatabaseName=sDataBaseDir + '/Vermeulen.db'

conn1 = sq.connect(sDatabaseName)

############################################################

####

sDataWarehousetDir=Base + '/99-DW'

if not os.path.exists(sDataWarehousetDir):

os.makedirs(sDataWarehousetDir)

sDatabaseName=sDataWarehousetDir + '/datawarehouse.db'

conn2 = sq.connect(sDatabaseName)

print('\n#################################')

print('Time Dimension')

BirthZone = 'Atlantic/Reykjavik'

BirthDateUTC = datetime(1960,12,20,10,15,0) BirthDateZoneUTC=BirthDateUTC.replace(tzinfo=timezone('UTC')) BirthDateZoneStr=BirthDateZoneUTC.strftime("%Y-%m-%d

%H:%M:%S")

BirthDateZoneUTCStr=BirthDateZoneUTC.strftime("%Y-%m-%d

%H:%M:%S (%Z)

(%z)")

BirthDate = BirthDateZoneUTC.astimezone(timezone(BirthZone))

BirthDateStr=BirthDate.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)") BirthDateLocal=BirthDate.strftime("%Y-%m-%d %H:%M:%S")

############################################################

####

IDTimeNumber=str(uuid.uuid4())

TimeLine=[('TimeID', [IDTimeNumber]),

('UTCDate', [BirthDateZoneStr]),

('LocalTime', [BirthDateLocal]),

('TimeZone', [BirthZone])]

TimeFrame = pd.DataFrame.from_items(TimeLine)

############################################################

####

DimTime=TimeFrame

DimTimeIndex=DimTime.set_index(['TimeID'],inplace=False)

sTable = 'Dim-Time'

print('\n#################################')

print('Storing :',sDatabaseName,'\n Table:',sTable)

print('\n#################################')

DimTimeIndex.to_sql(sTable, conn1, if_exists="replace")

DimTimeIndex.to_sql(sTable, conn2, if_exists="replace")

print('\n#################################')

print('Dimension Person')

print('\n#################################')

FirstName = 'Guðmundur'

LastName = 'Gunnarsson'

############################################################

###

IDPersonNumber=str(uuid.uuid4())

PersonLine=[('PersonID', [IDPersonNumber]),

('FirstName', [FirstName]),

('LastName', [LastName]),

('Zone', ['UTC']),

('DateTimeValue', [BirthDateZoneStr])]

PersonFrame = pd.DataFrame.from_items(PersonLine)

############################################################

####

DimPerson=PersonFrame

DimPersonIndex=DimPerson.set_index(['PersonID'],inplace=False)

############################################################

####

sTable = 'Dim-Person'

print('\n#################################')

print('Storing :',sDatabaseName,'\n Table:',sTable)

print('\n#################################')

DimPersonIndex.to_sql(sTable, conn1, if_exists="replace")

DimPersonIndex.to_sql(sTable, conn2, if_exists="replace") print('\n#################################')

print('Fact - Person - time')

print('\n#################################')

IDFactNumber=str(uuid.uuid4())

PersonTimeLine=[('IDNumber',

[IDFactNumber]), ('IDPersonNumber',

[IDPersonNumber]), ('IDTimeNumber', [IDTimeNumber])]

PersonTimeFrame = pd.DataFrame.from_items(PersonTimeLine)

############################################################

####

FctPersonTime=PersonTimeFrame FctPersonTimeIndex=FctPersonTime.set_index(['IDNumber'],inplace=Fal se)

############################################################

####

sTable = 'Fact-Person-Time'

print('\n#################################')

print('Storing:',sDatabaseName,'\n Table:',sTable)

print('\n#################################')

FctPersonTimeIndex.to_sql(sTable, conn1, if_exists="replace")

FctPersonTimeIndex.to_sql(sTable, conn2, if_exists="replace")

Gunnarsson- Sun-Model.py in directory

8.4 TRANSFORMING WITH DATA SCIENCE

8.4.1 Missing Value Treatment:

We must describe the missing value treatment in the transformation. The missing value treatment must be acceptable by the business community.

8.4.2 Why Missing Value Treatment Is Required:

Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analyzed the behavior and relationship with other variables correctly. It can lead to wrong prediction or classification.

8.4.3 Why Data Has Missing Values:

Following are some common reasons for missing data:

• Data fields were renamed during upgrades

• Mappings were incomplete during the migration processes from old systems to new systems

• Wrong table name was provided during loading

• Data was not available

• Legal reasons, owing to data protection legislation, such as the General Data Protection Regulation (GDPR).

• Poor data science. People and projects make mistakes during data science process.

8.5 COMMON FEATURE EXTRACTION TECHNIQUES

Following are common feature extraction techniques that help us to enhance existing data warehouse, by applying data science to the data in the warehouse.

8.5.1 Binning:

Binning technique is used to reduce the complexity of data sets, to enable the data scientist to evaluate the data with an organized grouping technique.

Binning is a good way for you to turn continuous data into a data set that has specific features that you can evaluate for patterns. For example, if you have data about a group of people, you might want to arrange their ages into a smaller number of age intervals (for example, grouping every five years together).

import numpy

data = numpy.random.random(100)

bins = numpy.linspace(0, 1, 10)

digitized = numpy.digitize(data, bins)

bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]

print(bin_means)

#The second is to use the histogram function.

bin_means2 = (numpy.histogram(data, bins, weights=data)[0] /

numpy.histogram(data, bins)[0])

print(bin_means2)

8.5.2 Averaging:

The use of averaging enables you to reduce the amount of records you require to report any activity that demands a more indicative, rather than a precise, total.

Example:

Create a model that enables you to calculate the average position for ten sample points. First, set up the ecosystem.

import numpy as np

import pandas as pd

#Create two series to model the latitude and longitude ranges.

LatitudeData = pd.Series(np.array(range(-90,91,1)))

LongitudeData = pd.Series(np.array(range(-14.20,14.21,1)))

#Select 10 samples for each range:
LatitudeSet=LatitudeData.sample(10)

LongitudeSet=LongitudeData.sample(10)

#Calculate the average of each data set

LatitudeAverage = np.average(LatitudeSet)

LongitudeAverage = np.average(LongitudeSet)

#See the results

print('Latitude')

print(LatitudeSet)

print('Latitude (Avg):',LatitudeAverage)

print('##############')

print('Longitude')

print(LongitudeSet)

print('Longitude (Avg):', LongitudeAverage)

Set of common data science terminology

8.6 HYPOTHESIS TESTING

Hypothesis testing must be known to any data scientist. You cannot progress until you have thoroughly mastered this technique. Hypothesis testing is a statistical test to check if a hypothesis is true based on the available data. Based on testing, data scientists choose to accept or reject (not accept) the hypothesis. To check whether the event is an important occurrence or just happenstance, hypothesis testing is necessary. When an event occurs, it can be a trend or at random.

8.6.1 T-Test:

The t-test is one of many tests used for the purpose of hypothesis testing in statistics. A t-test is a popular statistical test to make inferences about single means or inferences about two means or variances, to check if the two groups’ means are statistically different from each other, where n(sample size) < 30 and standard deviation is unknown.

The One Sample tTest determines whether the sample mean is statistically different from a known or hypothesised population mean. The One Sample tTest is a parametric test.

H0: Mean age of given sample is 30.

H1: Mean age of given sample is not 30

#pip3 install scipy

#pip3 install numpy

from scipy.stats import ttest_1samp

import numpy as np

ages = np.genfromtxt('ages.csv')

print(ages)

ages_mean = np.mean(ages)

print("Mean age:",ages_mean)

print("Test 1: m=30")

tset, pval = ttest_1samp(ages, 30)

print('p-values - ',pval)

if pval< 0.05:

print("we reject null hypothesis")

else:

print("we fail to reject null hypothesis")

8.6.2 Chi-Square Test:

A chi-square (or squared [χ2]) test is used to check if two variables are significantly different from each other. These variables are categorical.

import numpy as np

import pandas as pd

import scipy.stats as stats

np.random.seed(10)

stud_grade = np.random.choice(a=["O","A","B","C","D"],

p=[0.20, 0.20 ,0.20, 0.20, 0.20], size=100)

stud_gen = np.random.choice(a=["Male","Female"], p=[0.5, 0.5],

size=100)

mscpart1 = pd.DataFrame({"Grades":stud_grade, "Gender":stud_gen})

print(mscpart1)

stud_tab = pd.crosstab(mscpart1.Grades, mscpart1.Gender, margins=True)

stud_tab.columns = ["Male", "Female", "row_totals"]

stud_tab.index = ["O", "A", "B", "C", "D", "col_totals"]

observed = stud_tab.iloc[0:5, 0:2 ]

print(observed)

expected = np.outer(stud_tab["row_totals"][0:5],

stud_tab.loc["col_totals"][0:2]) / 100

print(expected)

chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()

print('Calculated : ',chi_squared_stat)

crit = stats.chi2.ppf(q=0.95, df=4)

print('Table Value : ',crit)

if chi_squared_stat>= crit:

print('H0 is Accepted ')

else:

print('H0 is Rejected ')

8.7 OVERFITTING & UNDERFITTING

Overfitting and Underfitting, these are the major problems faced by the data scientists when they retrieve the data insights from the training data sets which they are using. They refer to the deficiencies that the model’s performance might suffer from.

Overfitting occurs when the model or the algorithm fits the data too well. When a model gets trained with so much of data, it starts learning from the noise and inaccurate data entries in our data set. But the problem then occurred is, the model will not be able to categorize the data correctly, and this happens because of too much of details and noise.

Underfitting occurs when the model or the algorithmcannot capture the underlying trend of the data. Intuitively, underfitting occurs when the model or the algorithm does not fit the data well enough. It is often a result of an excessively simple model. It destroys the accuracy of our model.

8.7.1 Polynomial Features:

The polynomic formula is the following:

(a1x + b1) (a2x + b2) = a1a2x 2 + (a1b2 + a2b1) x + b1b2.

The polynomial feature extraction can use a chain of polynomic formulas to create a hyperplane that will subdivide any data sets into the correct cluster groups. The higher the polynomic complexity, the more precise the result that can be achieved.

Example:

import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import Ridge

from sklearn.preprocessing import PolynomialFeatures

from sklearn.pipeline import make_pipeline

def f(x):

""" function to approximate by polynomial interpolation"""

return x * np.sin(x)

# generate points used to plot

x_plot = np.linspace(0, 10, 100)

# generate points and keep a subset of them

x = np.linspace(0, 10, 100)

rng = np.random.RandomState(0)

rng.shuffle(x)

x = np.sort(x[:20])

y = f(x)

# create matrix versions of these arrays

X = x[:,np.newaxis]

X_plot = x_plot[:,np.newaxis]

colors = ['teal', 'yellowgreen', 'gold']

lw = 2

plt.plot(x_plot, f(x_plot), color='cornflowerblue', linewidth=lw,

label="Ground Truth")

plt.scatter(x, y, color='navy', s=30, marker='o', label="training points")

for count, degree in enumerate([3, 4, 5]):

model = make_pipeline(PolynomialFeatures(degree), Ridge())

model.fit(X, y)

y_plot = model.predict(X_plot)

plt.plot(x_plot, y_plot, color=colors[count], linewidth=lw,

label="Degree %d" % degree)

plt.legend(loc='lower left')

plt.show()

8.7.2 Common Data-Fitting Issue:

These higher order polynomic formulas are, however, more prone to overfitting, while lower order formulas are more likely to underfit. It is a delicate balance between two extremes that support good data science.

Example:

import numpy as np

import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import PolynomialFeatures

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import cross_val_score

def true_fun(X):

return np.cos(1.5 * np.pi * X)

np.random.seed(0)

n_samples = 30

degrees = [1, 4, 15]

X = np.sort(np.random.rand(n_samples))

y = true_fun(X) + np.random.randn(n_samples) * 0.1

plt.figure(figsize=(14, 5))

for i in range(len(degrees)):

ax = plt.subplot(1, len(degrees), i + 1)

plt.setp(ax, xticks=(), yticks=())

polynomial_features = PolynomialFeatures(degree=degrees[i],

include_bias=False)

linear_regression = LinearRegression()

pipeline = Pipeline([("polynomial_features", polynomial_features),

("linear_regression", linear_regression)])

pipeline.fit(X[:, np.newaxis], y)

# Evaluate the models using crossvalidation

scores = cross_val_score(pipeline, X[:, np.newaxis], y,

scoring="neg_mean_squared_error", cv=10)

X_test = np.linspace(0, 1, 100)

plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")

plt.plot(X_test, true_fun(X_test), label="True function")

plt.scatter(X, y, edgecolor='b', s=20, label="Samples")

plt.xlabel("x")

plt.ylabel("y")

plt.xlim((0, 1))

plt.ylim((-2, 2))

plt.legend(loc="best")

plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format( degrees[i], -

scores.mean(), scores.std()))

plt.show()

8.8 PRECISION-RECALL

Precision-recall is a useful measure for successfully predicting when classes are extremely imbalanced. In information retrieval,

• Precision is a measure of result relevancy.

• Recall is a measure of how many truly relevant results are returned.

8.8.1 Precision-Recall Curve:

The precision-recall curve shows the trade-off between precision and recall for different thresholds. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).

A system with high recalls but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels. A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels. An ideal system with high precision and high recall will return many results, with all results labelled correctly.

Precision (P) is defined as the number of true positives (Tp) over the number of true positives (Tp) plus the number of false positives (Fp).

8.8.2 Sensitivity & Specificity:

Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as a classification function. Sensitivity (also called the true positive rate, the recall, or probability of detection) measures the proportion of positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition). Specificity (also called the true negative rate) measures the proportion of negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).

8.8.3 F1-Measure:

The F1-score is a measure that combines precision and recall in the harmonic mean of precision and recall.

Note: The precision may not decrease with recall.

The following sklearn functions are useful when calculating these measures:

• sklearn.metrics.average_precision_score

• sklearn.metrics.recall_score

• sklearn.metrics.precision_score

• sklearn.metrics.f1_score

8.8.4 Receiver Operating Characteristic (ROC) Analysis Curves:

A receiver operating characteristic (ROC) analysis curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The ROC curve plots the truepositive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true positive rate is also known as sensitivity, recall, or probability of detection.

You will find the ROC analysis curves useful for evaluating whether your classification or feature engineering is good enough to determine the value of the insights you are finding. This helps with repeatable results against a real-world data set. So, if you suggest that your customers should take as pecific action as a result of your findings, ROC analysis curves will support your advice and insights but also relay the quality of the insights at given parameters.

8.9 CROSS-VALIDATION TEST

Cross-validation is a model validation technique for evaluating how the results of a statistical analysis will generalize to an independent data set. It is mostly used in settings where the goal is the prediction. Knowing how to calculate a test such as this enables you to validate the application of your model on real-world, i.e., independent data sets.

Example:

import numpy as np

from sklearn.model_selection import cross_val_score

from sklearn import datasets, svm

import matplotlib.pyplot as plt

digits = datasets.load_digits()

X = digits.data

y = digits.target

Let’s pick three different kernels and compare how they will perform.

kernels=['linear', 'poly', 'rbf']

for kernel in kernels:

svc = svm.SVC(kernel=kernel)

C_s = np.logspace(-15, 0, 15)

scores = list()

scores_std = list()

for C in C_s:

svc.C = C

this_scores = cross_val_score(svc, X, y, n_jobs=1)

scores.append(np.mean(this_scores))

scores_std.append(np.std(this_scores))

You must plot your results.

Title="Kernel:>" + kernel

fig=plt.figure(1, figsize=(4.2, 6))

plt.clf()

fig.suptitle(Title, fontsize=20)

plt.semilogx(C_s, scores)

plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--')

plt.semilogx(C_s, np.array(scores) - np.array(scores_std), 'b--')

locs, labels = plt.yticks()

plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))

plt.ylabel('Cross-Validation Score')

plt.xlabel('Parameter C')

plt.ylim(0, 1.1)

plt.show()

Well done. You can now perform cross-validation of your results.

8.10 UNIVARIATE ANALYSIS

This type of data consists of only one variable. The analysis of univariate data is thus the simplest form of analysis since the information deals with only one quantity that changes. It does not deal with causes or relationships and the main purpose of the analysis is to describe the data and find patterns that exist within it. The example of a univariate data can be height.

Suppose that the heights of seven students of a class is recorded (in the above figure), there is only one variable that is height and it is not dealing with any cause or relationship. The description of patterns found in this type of data can be made by drawing conclusions using central tendency measures (mean, median and mode), dispersion or spread of data (range, minimum, maximum, quartiles, variance and standard deviation) and by using frequency distribution tables, histograms, pie charts, frequency polygon and bar charts.

8.11 BIVARIATE ANALYSIS

This type of data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to find out the relationship among the two variables. Example of bivariate data can be temperature and ice cream sales in summer season.

Suppose the temperature and ice cream sales are the two variables of a bivariate data (in the above figure). Here, the relationship is visible from the table that temperature and sales are directly proportional to each 114 other and thus related because as the temperature increases, the sales also increase. Thus, bivariate data analysis involves comparisons, relationships, causes and explanations. These variables are often plotted on X and Y axis on the graph for better understanding of data and one of these variables is independent while the other is dependent.

8.12 MULTIVARIATE ANALYSIS

When the data involves three or more variables, it is categorized under multivariate. Example of this type of data is suppose an advertiser wants to compare the popularity of four advertisements on a website, then their click rates could be measured for both men and women and relationships between variables can then be examined.

It is similar to bivariate but contains more than one dependent variable. The ways to perform analysis on this data depends on the goals to be achieved. Some of the techniques are regression analysis, path analysis, factor analysis and multivariate analysis of variance (MANOVA).

8.13 LINEAR REGRESSION

Linear regression is a statistical modelling technique that endeavours to model the relationship between an explanatory variable and a dependent variable, by fitting the observed data points on a linear equation, for example, modelling the body mass index (BMI) of individuals by using their weight.

Linear regression is often used in business, government, and other scenarios. Some common practical applications of linear regression in the real world include the following:

• Real estate: A simple linear regression analysis can be used to model residential home prices as a function of the home's living area. Such a model helps set or evaluate the list price of a home on the market. The model could be further improved by including other input variables such as number of bathrooms, number of bedrooms, lot size, school district rankings, crime statistics, and property taxes

• Demand forecasting: Businesses and governments can use linear regression models to predict demand for goods and services. For example, restaurant chains can appropriately prepare for the predicted type and quantity of food that customers will consume based upon the weather, the day of the week, whether an item is offered as a special, the time of day, and the reservation volume. Similar models can be built to predict retail sales, emergency room visits, and ambulance dispatches.

• Medical: A linear regression model can be used to analyze the effect of a proposed radiation treatment on reducing tumour sizes. Input variables might include duration of a single radiation treatment, frequency of radiation treatment, and patient attributes such as age or weight.

8.13.1 Simple Linear Regression:

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. For example, a modeler might want to relate the weights of individuals to their heights using a linear regression model.

Before attempting to fit a linear model to observed data, a modeler should first determine whether or not there is a relationship between the variables of interest. This does not necessarily imply that one variable causes the other (for example, higher SAT scores do not cause higher college grades), but that there is some significant association between the two variables. A scatterplot can be a helpful tool in determining the strength of the relationship between two variables. If there appears to be no association between the proposed explanatory and dependent variables (i.e., the scatterplot does not indicate any increasing or decreasing trends), then fitting a linear regression model to the data probably will not provide a useful model. A valuable numerical measure of association between two variables is the correlation coefficient, which is a value between -1 and 1 indicating the strength of the association of the observed data for the two variables.

A linear regression line has an equation of the form (without error):

Y = a + bX,

Where, X = explanatory variable

Y = dependent variable

b = slope of the line

a = intercept (the value of y when x = 0)

A linear regression model can be expressed as follows (with error):

8.13.2 RANSAC Linear Regression:

RANSAC is an acronym for Random Sample Consensus. What this algorithm does is fit a regression model on a subset of data that the algorithm judges as inliers while removing outliers. This naturally improves the fit of the model due to the removal of some data points. An advantage of RANSAC is its ability to do robust estimation of the model parameters, i.e., it can estimate the parameters with a high degree of accuracy, even when a significant number of outliers are present in the data set. The process will find a solution, because it is so robust.

The process that is used to determine inliers and outliers is described below.

1. The algorithm randomly selects a random number of samples to be inliers in the model.

2. All data is used to fit the model and samples that fall with a certain tolerance are relabelled as inliers.

3. Model is refitted with the new inliers.

4. Error of the fitted model vs the inliers is calculated.

5. Terminate or go back to step 1 if a certain criterion of iterations or performance is not met.

8.13.3 Hough Transform:

The Hough transform is a feature extraction technique used in image analysis, computer vision, and digital image processing. The purpose of the technique is to find imperfect instances of objects within a certain class of shapes, by a voting procedure. This voting procedure is carried out in a parameter space, from which object candidates are obtained as local maxima in a so-called accumulator space that is explicitly constructed by the algorithm for computing the Hough transform.

With the help of the Hough transformation, this regression improves the resolution of the RANSAC technique, which is extremely useful when using robotics and robot vision in which the robot requires the regression of the changes between two data frames or data sets to move through an environment.

8.14 LOGISTIC REGRESSION

In linear regression modelling, the outcome variable is a continuous variable. When the outcome variable is categorical in nature, logistic regression can be used to predict the likelihood of an outcome based on the input variables. Although logistic regression can be applied to an outcome variable that represents multiple values, but we will examine the case in which the outcome variable represents two values such as true/false, pass/fail, or yes/no.

For example, a logistic regression model can be built to determine if a person will or will not purchase a new automobile in the next 12 months. The training set could include input variables for a person's age, income, and gender as well as the age of an existing automobile. The training set would also include the outcome variable on whether the person purchased a new automobile over a 12-month period. The logistic regression model provides the likelihood or probability of a person making a purchase in the next 12 months.

The logistic regression model is applied to a variety of situations in both the public and the private sector. Some common ways that the logistic regression model is used include the following:

• Medical: Develop a model to determine the likelihood of a patient's successful response to a specific medical treatment or procedure. Input variables could include age, weight, blood pressure, and cholesterol levels.

• Finance: Using a loan applicant's credit history and the details on the loan, determine the probability that an applicant will default on the loan. Based on the prediction, the loan can be approved or denied, or the terms can be modified.

• Marketing: Determine a wireless customer's probability of switching carriers (known as churning) based on age, number of family members on the plan, months remaining on the existing contract, and social network contacts. With such insight, target the high-probability customers with appropriate offers to prevent churn.

• Engineering: Based on operating conditions and various diagnostic measurements, determine the probability of a mechanical part experiencing a malfunction or failure. With this, probability estimate, schedule the appropriate preventive maintenance activity.

8.14.1 Simple Logistic Regression:

Simple logistic regression can be used when you have one nominal variable with two values (male/female, dead/alive, etc.) and one measurement variable. The nominal variable is the dependent variable, and the measurement variable is the independent variable. Logistic Regression, also known as Logit Regression or Logit Model. Logistic Regression works with binary data, where either the event happens (1) or the event does not happen (0).

Simple logistic regression is analogous to linear regression, except that the dependent variable is nominal, not a measurement. One goal is to see whether the probability of getting a particular value of the nominal variable is associated with the measurement variable; the other goal is to predict the probability of getting a particular value of the nominal variable, given the measurement variable.

For example, a logistic regression model can be built to determine if a person will or will not purchase a new automobile in the next 12 months. The training set could include input variables for a person's age, income, and gender as well as the age of an existing automobile. The training set would also include the outcome variable on whether the person purchased a new automobile over a 12-month period. The logistic regression model provides the likelihood or probability of a person making a purchase in the next 12 months.

Logistics Regression is based on the logistics function f(y), as given in the equation below,

8.14.2 Multinomial Logistic Regression:

Multinomial logistic regression (often just called 'multinomial regression') is used to predict a nominal dependent variable given one or more independent variables. It is sometimes considered an extension of binomial logistic regression to allow for a dependent variable with more than two categories. As with other types of regression, multinomial logistic regression can have nominal and/or continuous independent variables and can have interactions between independent variables to predict the dependent variable. Multinomial Logistic Regression is the regression analysis to conduct when the dependent variable is nominal with more than two levels.

For example, you could use multinomial logistic regression to understand which type of drink consumers prefer based on location in the UK and age (i.e., the dependent variable would be "type of drink", with four categories – Coffee, Soft Drink, Tea and Water – and your independent variables would be the nominal variable, "location in UK", assessed using three categories – London, South UK and North UK – and the continuous variable, "age", measured in years). Alternately, you could use multinomial logistic regression to understand whether factors such as employment duration within the firm, total employment duration, qualifications and gender affect a person's job position (i.e., the dependent variable would be "job position", with three categories – junior management, middle management and senior management – and the independent variables would be the continuous variables, "employment duration within the firm" and "total employment duration", both measured in years, the nominal variables, "qualifications", with four categories – no degree, undergraduate degree, master's degree and PhD – "gender", which has two categories: "males" and "females").

8.14.3 Ordinal Logistic Regression:

Ordinal logistic regression (often just called 'ordinal regression') is used to predict an ordinal dependent variable given one or more independent variables. It can be considered as either a generalisation of multiple linear regression or as a generalisation of binomial logistic regression, but this guide will concentrate on the latter. As with other types of regression, ordinal regression can also use interactions between independent variables to predict the dependent variable.

For example, you could use ordinal regression to predict the belief that "tax is too high" (your ordinal dependent variable, measured on a 4- point Likert item from "Strongly Disagree" to "Strongly Agree"), based on two independent variables: "age" and "income". Alternately, you could use ordinal regression to determine whether a number of independent variables, such as "age", "gender", "level of physical activity" (amongst others), predict the ordinal dependent variable, "obesity", where obesity is measured using three ordered categories: "normal", "overweight" and "obese".

8.15 CLUSTERING TECHNIQUES

In general, clustering is the use of unsupervised techniques for grouping similar objects. In machine learning, unsupervised refers to the problem of finding hidden structure within unlabelled data. Clustering techniques are unsupervised in the sense that the data scientist does not determine, in advance, the labels to apply to the clusters. The structure of the data describes the objects of interest and determines how best to group the objects. Clustering is a method often used for exploratory analysis of the data. In clustering, there are no predictions made. Rather, clustering methods find the similarities between objects according to the object attributes and group the similar objects into clusters. Clustering techniques are utilized in marketing, economics, and various branches of science.

Clustering is often used as a lead-in to classification. Once the clusters are identified, labels can be applied to each cluster to classify each group based on its characteristics. Some specific applications of Clustering are image processing, medical and customer segmentation.

• Image Processing: Video is one example of the growing volumes of unstructured data being collected. Within each frame of a video, k-means analysis can be used to identify objects in the video. For each frame, the task is to determine which pixels are most similar to each other. The attributes of each pixel can include brightness, color, and location, the x and y coordinates in the frame. With security video images, for example, 120 successive frames are examined to identify any changes to the clusters. These newly identified dusters may indicate unauthorized access to a facility.

• Medical: Patient attributes such as age, height, weight, systolic and diastolic blood pressures, cholesterol level, and other attributes can identify naturally occurring clusters. These dusters could be used to target individuals for specific preventive measures or clinical trial participation. Clustering, in general, is useful in biology for the classification of plants and animals as well as in the field of human genetics.

• Customer Segmentation: Marketing and sales groups use k-means to better identify customers who have similar behaviours and spending patterns. For example, a wireless provider may look at the following customer attributes: monthly bill, number of text messages, data volume consumed, minutes used during various daily periods, and years as a customer. The wireless company could then look at the naturally occurring clusters and consider tactics to increase sales or reduce the customer churn rate, the proportion of customers who end their relationship with a particular company.

8.15.1 Hierarchical Clustering:

Hierarchical clustering is a method of cluster analysis whereby you build a hierarchy of clusters. This works well for data sets that are complex and have distinct characteristics for separated clusters of data. Also called Hierarchical cluster analysis or HCA is an unsupervised clustering algorithm which involves creating clusters that have predominant ordering from top to bottom.

For example: All files and folders on our hard disk are organized in a hierarchy.

The algorithm groups similar objects into groups called clusters. The endpoint is a set of clusters or groups, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.

This clustering technique is divided into two types:

1. Agglomerative Hierarchical Clustering

2. Divisive Hierarchical Clustering

Agglomerative Hierarchical Clustering:

The Agglomerative Hierarchical Clustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. It’s also known as AGNES (Agglomerative Nesting). It's a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

How does it work?

1. Make each data point a single-point cluster → forms N clusters

2. Take the two closest data points and make them one cluster → forms N-1 clusters

3. Take the two closest clusters and make them one cluster → Forms N-2 clusters.

4. Repeat step-3 until you are left with only one cluster.

Divisive Hierarchical Clustering:

In Divisive or DIANA (DIvisiveANAlysis Clustering) is a top[1]down clustering method where we assign all of the observations to a single cluster and then partition the cluster to two least similar clusters. Finally, we proceed recursively on each cluster until there is one cluster for each observation. So this clustering approach is exactly opposite to Agglomerative clustering.

8.15.2 Partitional Clustering:

A partitional clustering is simply a division of the set of data objects into nonoverlapping subsets (clusters), such that each data object is in exactly one subset. Partitional clustering decomposes a data set into a set of disjoint clusters. Given a data set of N points, a partitioning method constructs K (N ≥ K) partitions of the data, with each partition representing a cluster. That is, it classifies the data into K groups by satisfying the following requirements: (1) each group contains at least one point, and (2) each point belongs to exactly one group. Notice that for fuzzy partitioning, a point can belong to more than one group.

Many partitional clustering algorithms try to minimize an objective function. For example, in K-means and K-medoids the function (also referred to as the distortion function) is:

8.16 ANOVA

The ANOVA test is the initial step in analysing factors that affect a given data set. Once the test is finished, an analyst performs additional testing on the methodical factors that measurably contribute to the data set's inconsistency. The analyst utilizes the ANOVA test results in an f[1]test to generate additional data that aligns with the proposed regression models. The ANOVA test allows a comparison of more than two groups at the same time to determine whether a relationship exists between them.

Example: A BOGOF (buy-one-get-one-free) campaign is executed on 5 groups of 100 customers each. Each group is different in terms of its demographic attributes. We would like to determine whether these five respond differently to the campaign. This would help us optimize the right campaign for the right demographic group, increase the response rate, and reduce the cost of the campaign.

The analysis of variance works by comparing the variance between the groups to that within the group. The core of this technique lies in assessing whether all the groups are in fact part of one larger population or a completely different population with different characteristics.

The Formula for ANOVA is:

There are two types of ANOVA: one-way (or unidirectional) and two-way. One-way or two-way refers to the number of independent variables in your analysis of variance test. A one-way ANOVA evaluates the impact of a sole factor on a sole response variable. It determines whether all the samples are the same. The one-way ANOVA is used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups.

A two-way ANOVA is an extension of the one-way ANOVA. With a one-way, you have one independent variable affecting a dependent variable. With a two-way ANOVA, there are two independents. For example, a two-way ANOVA allows a company to compare worker productivity based on two independent variables, such as salary and skill set. It is utilized to observe the interaction between the two factors and tests the effect of two factors at the same time.

8.17 DECISION TREES

A decision tree (also called prediction tree) uses a tree structure to specify sequences of decisions and consequences. Given input X = {x₁,x₂,...x_n}, the goal is to predict a response or output variable Y. Each member of the set {x₁,x₂,...x_n} is called an input variable. The prediction can be achieved by constructing a decision tree with test points and branches. At each test point, a decision is made to pick a specific branch and traverse down the tree. Eventually, a final point is reached, and a prediction can be made. Due to its flexibility and easy visualization, decision trees are commonly deployed in data mining applications for classification purposes.

The input values of a decision tree can be categorical or continuous. A decision tree employs a structure of testpoints (called nodes) and branches, which represent the decision being made. A node without further branches is called a leaf node. The leaf nodes return class labels and, in some implementations, they return the probabilityscores. A decision tree can be converted into a set of decision rules. In the following example rule, income andmortgage_amountare input variables, and the response is the output variable default with a probability score.

IF income <50,000 AND mortgage_amount><50,000 AND mortgage_amount> 100K

THEN default = True WITH PROBABILITY 75%

Decision trees have two varieties: classification trees and regression trees. Classification trees usuallyapply to output variables that are categorical—often binary—in nature, such as yes or no, purchase or notpurchase, and so on. Regression trees, on the other hand, can apply to output variables that are numeric orcontinuous, such as the predicted price of a consumer good or the likelihood a subscription will bepurchased.

Example:

The above figure shows an example of using a decision tree to predict whether customers will buy a product. The term branch refers to the outcome of a decision and is visualized as a line connecting two nodes. If a decision is numerical, the "greater than" branch is usually placed on the right, and the "less than" branch is placed on the left. Depending on the nature of the variable, one of the branches may need to include an "equal to" component.

Internal nodes are the decision or test points. Each internal node refers to an input variable or an attribute. The top internal node is called the root. The decision tree in the above figure is a binary tree in that each internal node has no more than two branches. The branching of a node is referred to as a split.

The depth of a node is the minimum number of steps required to reach the node from the root. In above figure for example, nodes Income and Age have a depth of one, and the four nodes on the bottom of the tree have a depth of two. Leaf nodes are at the end of the last branches on the tree. They represent class labels—the outcome of all the prior decisions. The path from the root to a leaf node contains a series of decisions made at various internal nodes.

The decision tree in the above figure shows that females with income less than or equal to $45,000 and males 40 years old or younger are classified as people who would purchase the product. In traversing this tree, age does not matter for females, and income does not matter for males.

Where decision tree is used?

• Decision trees are widely used in practice.

• To classify animals, questions (like cold-blooded or warm-blooded, mammal or not mammal) are answered to arrive at a certain classification.

• A checklist of symptoms during a doctor's evaluation of a patient.

• The artificial intelligence engine of a video game commonly uses decision trees to control the autonomous actions of a character in response to various scenarios.

• Retailers can use decision trees to segment customers or predict response rates to marketing and promotions.

• Financial institutions can use decision trees to help decide if a loan application should be approved or denied. In the case of loan approval, computers can use the logical if - then statements to predict whether the customer will default on the loan.

8.18 UNIT END QUESTIONS

1. Explain the transform superstep.

2. Explain the Sun model for TPOLE.

3. Explain Person-to-Time Sun Model.

4. Explain Person-to-Object Sun Model.

5. Why does data have missing values? Why do missing values need treatment? What methods treat missing values?

6. What is feature engineering? What are the common feature extraction techniques?

7. What is Binning? Explain with example.

8. Explain averaging and Latent Dirichlet Allocation with respect to the transform step of data science.

9. Explain hypothesis testing, t-test and chi-square test with respect to data science.

10. Explain over fitting and underfitting. Discuss the common fitting issues.

11. Explain precision recall, precision recall curve, sensitivity, specificity and F1 measure.

12. Explain Univariate Analysis.

13. Explain Bivariate Analysis.

14. What is Linear Regression? Give some common application of linear regression in the real world.

15. What is Simple Linear Regression? Explain.

16. Write a note on RANSAC Linear Regression.

17. Write a note on Logistic Regression.

18. Write a note on Simple Logistic Regression.

19. Write a note on Multinomial Logistic Regression.

20. Write a note on Ordinal Logistic Regression.

21. Explain CLustering techniques.

22. Explain Receiver Operating Characteristic (ROC) Analysis Curves and cross validation test.

23. Write a note on ANOVA.

24. Write a note on Decision Trees.

Monday, February 14, 2022

DATA SCIENCE TECHNOLOGY STACK

No comments:

Post a Comment

Labels

INSTRUMENTATION MANUFACTURERS