Industries Needs: March 2022

Tuesday, March 22, 2022

Data Science and Big Data Analytics

Discovering, Analyzing, Visualizing and Presenting Data

Advanced Analytical Theory and Methods: Classification

Key Concepts

1. Classification learning

2. Naïve Bayes

3. Decision tree

4. ROC curve

5. Confusion matrix

In addition to analytical methods such as clustering (Chapter 4, “Advanced Analytical Theory and Methods: Clustering”), association rule learning Chapter 5, “Advanced Analytical Theory and Methods: Association Rules”, and modeling techniques like regression (Chapter 6, “Advanced Analytical Theory and Methods: Regression”), classification is another fundamental learning method that appears in applications related to data mining. In classification learning, a classifier is presented with a set of examples that are already classified and, from these examples, the classifier learns to assign unseen examples. In other words, the primary task performed by classifiers is to assign class labels to new observations. Logistic regression from the previous chapter is one of the popular classification methods. The set of labels for classifiers is predetermined, unlike in clustering, which discovers the structure without a training set and allows the data scientist optionally to create and assign labels to the clusters.

Most classification methods are supervised, in that they start with a training set of prelabeled observations to learn how likely the attributes of these observations may contribute to the classification of future unlabeled observations. For example, existing marketing, sales, and customer demographic data can be used to develop a classifier to assign a “purchase” or “no purchase” label to potential future customers.

Classification is widely used for prediction purposes. For example, by building a classifier on the transcripts of United States Congressional floor debates, it can be determined whether the speeches represent support or opposition to proposed legislation [1]. Classification can help health care professionals diagnose heart disease patients [2]. Based on an e-mail’s content, e-mail providers also use classification to decide whether the incoming e-mail messages are spam [3].

This chapter mainly focuses on two fundamental classification methods: decision trees and naïve Bayes.

7.1 Decision Trees

A decision tree (also called prediction tree) uses a tree structure to specify sequences of decisions and consequences. Given input X= (x₁, x₂,…x_n), the goal is to predict a response or output variable Y. Each member of the set (x₁, x₂,…x_n) is called an input variable. The prediction can be achieved by constructing a decision tree with test points and branches. At each test point, a decision is made to pick a specific branch and traverse down the tree. Eventually, a final point is reached, and a prediction can be made. Each test point in a decision tree involves testing a particular input variable (or attribute), and each branch represents the decision being made. Due to its flexibility and easy visualization, decision trees are commonly deployed in data mining applications for classification purposes.

The input values of a decision tree can be categorical or continuous. A decision tree employs a structure of test points (called nodes) and branches, which represent the decision being made. A node without further branches is called a leaf node. The leaf nodes return class labels and, in some implementations, they return the probability scores. A decision tree can be converted into a set of decision rules. In the following example rule, income and mortgage_amount are input variables, and the response is the output variable default with a probability score.

IF income < $50,000 AND mortgage_amount > $100K

THEN default = True WITH PROBABILITY 75%

Decision trees have two varieties: classification trees and regression trees. Classification trees usually apply to output variables that are categorical—often binary—in nature, such as yes or no, purchase or not purchase, and so on. Regression trees, on the other hand, can apply to output variables that are numeric or continuous, such as the predicted price of a consumer good or the likelihood a subscription will be purchased.

Decision trees can be applied to a variety of situations. They can be easily represented in a visual way, and the corresponding decision rules are quite straightforward. Additionally, because the result is a series of logical if-then statements, there is no underlying assumption of a linear (or nonlinear) relationship between the input variables and the response variable.

7.1.1 Overview of a Decision Tree

Figure 7.1 shows an example of using a decision tree to predict whether customers will buy a product. The term branch refers to the outcome of a decision and is visualized as a line connecting two nodes. If a decision is numerical, the “greater than” branch is usually placed on the right, and the “less than” branch is placed on the left. Depending on the nature of the variable, one of the branches may need to include an “equal to” component.

Figure 7.1 Example of a decision tree

Internal nodes are the decision or test points. Each internal node refers to an input variable or an attribute. The top internal node is called the root. The decision tree in Figure 7.1 is a binary tree in that each internal node has no more than two branches. The branching of a node is referred to as a split.

Sometimes decision trees may have more than two branches stemming from a node. For example, if an input variable Weather is categorical and has three choices—Sunny, Rainy, and Snowy—the corresponding node Weather in the decision tree may have three branches labeled as Sunny, Rainy, and Snowy, respectively.

The depth of a node is the minimum number of steps required to reach the node from the root. In Figure 7.1 for example, nodes Income and Age have a depth of one, and the four nodes on the bottom of the tree have a depth of two.

Leaf nodes are at the end of the last branches on the tree. They represent class labels—the outcome of all the prior decisions. The path from the root to a leaf node contains a series of decisions made at various internal nodes.

In Figure 7.1, the root node splits into two branches with a Gender test. The right branch contains all those records with the variable Gender equal to Male, and the left branch contains all those records with the variable Gender equal to Female to create the depth 1 internal nodes. Each internal node effectively acts as the root of a subtree, and a best test for each node is determined independently of the other internal nodes. The left-hand side (LHS) internal node splits on a question based on the Income variable to create leaf nodes at depth 2, whereas the right-hand side (RHS) splits on a question on the Age variable.

The decision tree in Figure 7.1 shows that females with income less than or equal to $45,000 and males 40 years old or younger are classified as people who would purchase the product. In traversing this tree, age does not matter for females, and income does not matter for males.

Decision trees are widely used in practice. For example, to classify animals, questions (like cold-blooded or warm-blooded, mammal or not mammal) are answered to arrive at a certain classification. Another example is a checklist of symptoms during a doctor’s evaluation of a patient. The artificial intelligence engine of a video game commonly uses decision trees to control the autonomous actions of a character in response to various scenarios. Retailers can use decision trees to segment customers or predict response rates to marketing and promotions. Financial institutions can use decision trees to help decide if a loan application should be approved or denied. In the case of loan approval, computers can use the logical if-then statements to predict whether the customer will default on the loan. For customers with a clear (strong) outcome, no human interaction is required; for observations that may not generate a clear response, a human is needed for the decision.

By limiting the number of splits, a short tree can be created. Short trees are often used as components (also called weak learners or base learners) in ensemble methods. Ensemble methods use multiple predictive models to vote, and decisions can be made based on the combination of the votes. Some popular ensemble methods include random forest [4], bagging, and boosting [5]. Section 7.4 discusses these ensemble methods more.

The simplest short tree is called a decision stump, which is a decision tree with the root immediately connected to the leaf nodes. A decision stump makes a prediction based on the value of just a single input variable. Figure 7.2 shows a decision stump to classify two species of an iris flower based on the petal width. The figure shows that, if the petal width is smaller than 1.75 centimeters, it’s Iris versicolor; otherwise, it’s Iris virginica.

Figure 7.2 Example of a decision stump

To illustrate how a decision tree works, consider the case of a bank that wants to market its term deposit products (such as Certificates of Deposit) to the appropriate customers. Given the demographics of clients and their reactions to previous campaign phone calls, the bank’s goal is to predict which clients would subscribe to a term deposit. The dataset used here is based on the original dataset collected from a Portuguese bank on directed marketing campaigns as stated in the work by Moro et al. [6]. Figure 7.3 shows a subset of the modified bank marketing dataset. This dataset includes 2,000 instances randomly drawn from the original dataset, and each instance corresponds to a customer. To make the example simple, the subset only keeps the following categorical variables: (1) job, (2) marital status, (3) education level, (4) if the credit is in default, (5) if there is a housing loan, (6) if the customer currently has a personal loan, (7) contact type, (8) result of the previous marketing campaign contact (poutcome), and finally (9) if the client actually subscribed to the term deposit. Attributes (1) through (8) are input variables, and (9) is considered the outcome. The outcome subscribed is either yes (meaning the customer will subscribe to the term deposit) or no (meaning the customer won’t subscribe). All the variables listed earlier are categorical.

Figure 7.3 A subset of the bank marketing dataset

A summary of the dataset shows the following statistics. For ease of display, the summary only includes the top six most frequently occurring values for each attribute. The rest are displayed as (Other).

job marital education default

blue-collar:435 divorced: 228 primary : 335 no :1961

management :423 married :1201 secondary:1010 yes: 39

technician :339 single : 571 tertiary : 564

admin. :235 unknown : 91

services :168

retired : 92

(Other) :308

housing loan contact month poutcome

no : 916 no :1717 cellular :1287 may :581 failure: 210

yes:1084 yes: 283 telephone: 136 jul :340 other : 79

unknown : 577 aug :278 success: 58

jun :232 unknown:1653

nov :183

apr :118

(Other):268

subscribed

no :1789

yes: 211

Attribute job includes the following values.

admin. blue-collar entrepreneur housemaid

235 435 70 63

management retired self-employed services

423 92 69 168

student technician unemployed unknown

36 339 60 10

Figure 7.4 shows a decision tree built over the bank marketing dataset. The root of the tree shows that the overall fraction of the clients who have not subscribed to the term deposit is 1,789 out of the total population of 2,000.

Figure 7.4 Using a decision tree to predict if a client will subscribe to a term deposit

At each split, the decision tree algorithm picks the most informative attribute out of the remaining attributes. The extent to which an attribute is informative is determined by measures such as entropy and information gain, as detailed in Section 7.1.2.

At the first split, the decision tree algorithm chooses the poutcome attribute. There are two nodes at depth=1. The left node is a leaf node representing a group for which the outcome of the previous marketing campaign contact is a failure, other, or unknown. For this group, 1,763 out of 1,942 clients have not subscribed to the term deposit.

The right node represents the rest of the population, for which the outcome of the previous marketing campaign contact is a success. For the population of this node, 32 out of 58 clients have subscribed to the term deposit.

This node further splits into two nodes based on the education level. If the education level is either secondary or tertiary, then 26 out of 50 of the clients have not subscribed to the term deposit. If the education level is primary or unknown, then 8 out of 8 times the clients have subscribed.

The left node at depth 2 further splits based on attribute job. If the occupation is admin, blue collar, management, retired, services, or technician, then 26 out of 45 clients have not subscribed. If the occupation is self-employed, student, or unemployed, then 5 out of 5 times the clients have subscribed.

7.1.2 The General Algorithm

• All the leaf nodes in the tree satisfy the minimum purity threshold.

• The tree cannot be further split with the preset minimum purity threshold.

• Any other stopping criterion is satisfied (such as the maximum depth of the tree).

The first step in constructing a decision tree is to choose the most informative attribute. A common way to identify the most informative attribute is to use entropy-based methods, which are used by decision tree learning algorithms such as ID3 (or Iterative Dichotomiser 3) [7] and C4.5 [8]. The entropy methods select the most informative attribute based on two basic measures:

• Entropy, which measures the impurity of an attribute

• Information gain, which measures the purity of an attribute

As an example of a binary random variable, consider tossing a coin with known, not necessarily fair, probabilities of coming up heads or tails. The corresponding entropy graph is shown in Figure 7.5. Let x=1 represent heads and x=0 represent tails. The entropy of the unknown result of the next toss is maximized when the coin is fair. That is, when heads and tails have equal probability P(x=1)=P(x=0)=0.5, entropy H_x = -(0.5 X log₂0.5 + 0.5 X log₂ 0.5) =1. On the other hand, if the coin is not fair, the probabilities of heads and tails would not be equal and there would be less uncertainty. As an extreme case, when the probability of tossing a head is equal to 0 or 1, the entropy is minimized to 0. Therefore, the entropy for a completely pure variable is 0 and is 1 for a set with equal occurrences for both the classes (head and tail, or yes and no).

Figure 7.5 Entropy of coin flips, where X=1 represents heads

Consider the banking marketing scenario, if the attribute contact is chosen, = {cellular, telephone, unknown}. The conditional entropy of contact considers all three values.

Table 7.1 lists the probabilities related to the contact attribute. The top row of the table displays the probabilities of each value of the attribute. The next two rows contain the probabilities of the class labels conditioned on the contact.

Table 7.1 Conditional Entropy Example

Information gain compares the degree of purity of the parent node before a split with the degree of purity of the child node after a split. At each split, an attribute with the greatest information gain is considered the most informative attribute. Information gain indicates the purity of an attribute.

The result of information gain for all the input variables is shown in Table 7.2. Attribute poutcome has the most information gain and is the most informative variable. Therefore, poutcome is chosen for the first split of the decision tree, as shown in Figure 7.4. The values of information gain in Table 7.2 are small in magnitude, but the relative difference matters. The algorithm splits on the attribute with the largest information gain at each round.

Table 7.2 Calculating Information Gain of Input Variables for the First Split

Detecting Significant Splits

Quite often it is necessary to measure the significance of a split in a decision tree, especially when the information gain is small, like in Table 7.2.

Let N_A and N_B be the number of class A and class B in the parent node. Let N_ALrepresent the number of class A going to the left child node, N_BL represent the number of class B going to the left child node, N_AR represent the number of class B going to the right child node, N_BR and represent the number of class B going to the right child node.

Let P_L and P_R denote the proportion of data going to the left and right node, respectively.

P_L = N_AL + N_BL/N_A + N_B

P_R = N_AR + N_BR/N_A + N_B

The following measure computes the significance of a split. In other words, it measures how much the split deviates from what would be expected in the random data.

If K is small, the information gain from the split is not significant. If K is big, it would suggest the information gain from the split is significant.

After each split, the algorithm looks at all the records at a leaf node, and the information gain of each candidate attribute is calculated again over these records. The next split is on the attribute with the highest information gain. A record can only belong to one leaf node after all the splits, but depending on the implementation, an attribute may appear in more than one split of the tree. This process of partitioning the records and finding the most informative attribute is repeated until the nodes are pure enough, or there is insufficient information gain by splitting on more attributes. Alternatively, one can stop the growth of the tree when all the nodes at a leaf node belong to a certain class (for example, subscribed = yes) or all the records have identical attribute values.

In the previous bank marketing example, to keep it simple, the dataset only includes categorical variables. Assume the dataset now includes a continuous variable called duration –representing the number of seconds the last phone conversation with the bank lasted as part of the previous marketing campaign. A continuous variable needs to be divided into a disjoint set of regions with the highest information gain. A brute-force method is to consider every value of the continuous variable in the training data as a candidate split position. This brute-force method is computationally inefficient. To reduce the complexity, the training records can be sorted based on the duration, and the candidate splits can be identified by taking the midpoints between two adjacent sorted values. An examples is if the duration consists of sorted values {140, 160, 180, 200} and the candidate splits are 150, 170, and 190.

Figure 7.6 shows what the decision tree may look like when considering the duration attribute. The root splits into two partitions: those clients with seconds, and those with seconds. Note that for aesthetic purposes, labels for the job and contact attributes in the figure are abbreviated.

Figure 7.6 Decision tree with attribute duration

With the decision tree in Figure 7.6, it becomes trivial to predict if a new client is going to subscribe to the term deposit. For example, given the record of a new client shown in Table 7.3, the prediction is that this client will subscribe to the term deposit. The traversed paths in the decision tree are as follows.

• duration ≥ 456

• contact = cll (cellular)

• duration < 700

• job = ent (entrepreneur), rtr (retired)

Table 7.3 Record of a New Client

7.1.3 Decision Tree Algorithms

Multiple algorithms exist to implement decision trees, and the methods of tree construction vary with different algorithms. Some popular algorithms include ID3 [7], C4.5[8], and CART [9].

ID3 Algorithm

ID3 (or Iterative Dichotomiser 3) [7] is one of the first decision tree algorithms, and it was developed by John Ross Quinlan. Let A be a set of categorical input variables, P be the output variable (or the predicted class), and T be the training set. The ID3 algorithm is shown here.

1 ID3 (A, P, T)

2 if T ϵ φ

3 return φ

4 if all records in T have the same value for P

5 return a single node with that value

6 if A ϵ φ

7 return a single node with the most frequent value of P in T

8 Compute information gain for each attribute in A relative to T

9 Pick attribute D with the largest gain

10 Let {d₁, d₂, …., d_m} be the values of attribute D

11 Partition T into { T₁, T₂, …., T_m } according to the values of D

12 return a tree with root D and branches labeled d₁, d₂, …., d_m

going respectively to trees ID3 (A-{D}, P, T₁),

ID3(A-{D}, P, T₂), … ID3(A-{D}, P, T_m)

C4.5

The C4.5 algorithm [8] introduces a number of improvements over the original ID3 algorithm. The C4.5 algorithm can handle missing data. If the training records contain unknown attribute values, the C4.5 evaluates the gain for an attribute by considering only the records where the attribute is defined.

Both categorical and continuous attributes are supported by C4.5. Values of a continuous variable are sorted and partitioned. For the corresponding records of each partition, the gain is calculated, and the partition that maximizes the gain is chosen for the next split.

The ID3 algorithm may construct a deep and complex tree, which would cause overfitting (Section 7.1.4). The C4.5 algorithm addresses the overfitting problem in ID3 by using a bottom-up technique called pruning to simplify the tree by removing the least visited nodes and branches.

CART

CART (or Classification And Regression Trees) [9] is often used as a generic acronym for the decision tree, although it is a specific implementation.

ART can handle continuous attributes. Whereas C4.5 uses entropybased criteria to rank tests, CART uses the Gini diversity index defined in Equation 7.5.

Whereas C4.5 employs stopping rules, CART constructs a sequence of subtrees, uses cross-validation to estimate the misclassification cost of each subtree, and chooses the one with the lowest cost.

Decision trees use greedy algorithms, in that they always choose the option that seems the best available at that moment. At each step, the algorithm selects which attribute to use for splitting the remaining records. This selection may not be the best overall, but it is guaranteed to be the best at that step. This characteristic reinforces the efficiency of decision trees. However, once a bad split is taken, it is propagated through the rest of the tree. To address this problem, an ensemble technique (such as random forest) may randomize the splitting or even randomize data and come up with a multiple tree structure. These trees then vote for each class, and the class with the most votes is chosen as the predicted class.

There are a few ways to evaluate a decision tree. First, evaluate whether the splits of the tree make sense. Conduct sanity checks by validating the decision rules with domain experts, and determine if the decision rules are sound.

Next, look at the depth and nodes of the tree. Having too many layers and obtaining nodes with few members might be signs of overfitting. In overfitting, the model fits the training set well, but it performs poorly on the new samples in the testing set. Figure 7.7 illustrates the performance of an overfit model. The x-axis represents the amount of data, and the yaxis represents the errors. The blue curve is the training set, and the red curve is the testing set. The left side of the gray vertical line shows that the model predicts well on the testing set. But on the right side of the gray line, the model performs worse and worse on the testing set as more and more unseen data is introduced.

Figure 7.7 An overfit model describes the training data well but predicts poorly on unseen data

For decision tree learning, overfitting can be caused by either the lack of training data or the biased data in the training set. Two approaches [10] can help avoid overfitting in decision tree learning.

• Stop growing the tree early before it reaches the point where all the training data is perfectly classified.

• Grow the full tree, and then post-prune the tree with methods such as reduced-error pruning and rule-based post pruning.

Last, many standard diagnostics tools that apply to classifiers can help evaluate overfitting. These tools are further discussed in Section 7.3.

Decision trees are computationally inexpensive, and it is easy to classify the data. The outputs are easy to interpret as a fixed sequence of simple tests. Many decision tree algorithms are able to show the importance of each input variable. Basic measures, such as information gain, are provided by most statistical software packages.

Decision trees are able to handle both numerical and categorical attributes and are robust with redundant or correlated variables. Decision trees can handle categorical attributes with many distinct values, such as country codes for telephone numbers. Decision trees can also handle variables that have a nonlinear effect on the outcome, so they work better than linear models (for example, linear regression and logistic regression) for highly nonlinear problems. Decision trees naturally handle variable interactions. Every node in the tree depends on the preceding nodes in the tree.

In a decision tree, the decision regions are rectangular surfaces. Figure 7.8 shows an example of five rectangular decision surfaces (A, B, C, D, and E) defined by four values—ʎ₁, ʎ₂, ʎ₃, ʎ₄ —of two attributes ( x₁ and x₂). The corresponding decision tree is on the right side of the figure. A decision surface corresponds to a leaf node of the tree, and it can be reached by traversing from the root of the tree following by a series of decisions according to the value of an attribute. The decision surface can only be axis-aligned for the decision tree.

Figure 7.8 Decision surfaces can only be axis-aligned

The structure of a decision tree is sensitive to small variations in the training data. Although the dataset is the same, constructing two decision trees based on two different subsets may result in very different trees. If a tree is too deep, overfitting may occur, because each split reduces the training data for subsequent splits.

Decision trees are not a good choice if the dataset contains many irrelevant variables. This is different from the notion that they are robust with redundant variables and correlated variables. If the dataset contains redundant variables, the resulting decision tree ignores all but one of these variables because the algorithm cannot detect information gain by including more redundant variables. On the other hand, if the dataset contains irrelevant variables and if these variables are accidentally chosen as splits in the tree, the tree may grow too large and may end up with less data at every split, where overfitting is likely to occur. To address this problem, feature selection can be introduced in the data preprocessing phase to eliminate the irrelevant variables.

Although decision trees are able to handle correlated variables, decision trees are not well suited when most of the variables in the training set are correlated, since overfitting is likely to occur. To overcome the issue of instability and potential overfitting of deep trees, one can combine the decisions of several randomized shallow decision trees—the basic idea of another classifier called random forest [4]—or use ensemble methods to combine several weak learners for better classification. These methods have been shown to improve predictive power compared to a single decision tree.

For binary decisions, a decision tree works better if the training dataset consists of records with an even probability of each result. In other words, the root of the tree has a 50% chance of either classification. This occurs by randomly selecting training records from each possible classification in equal numbers. It counteracts the likelihood that a tree will stump out early by passing purity tests because of bias in the training data.

When using methods such as logistic regression on a dataset with many variables, decision trees can help determine which variables are the most useful to select based on information gain. Then these variables can be selected for the logistic regression. Decision trees can also be used to prune redundant variables.

7.1.5 Decision Trees in R

In R, rpart is for modeling decision trees, and an optional package rpart.plot enables the plotting of a tree. The rest of this section shows an example of how to use decision trees in R with rpart.plot to predict whether to play golf given factors such as weather outlook, temperature, humidity, and wind.

In R, first set the working directory and initialize the packages.

setwd(“c:/”)

install.packages(“rpart.plot”) # install package rpart.plot

library(“rpart”) # load libraries

library(“rpart.plot”)

The working directory contains a comma-separated-value (CSV) file named DTdata.csv. The file has a header row, followed by 10 rows of training data.

Play,Outlook,Temperature,Humidity,Wind

yes,rainy,cool,normal,FALSE

no,rainy,cool,normal,TRUE

yes,overcast,hot,high,FALSE

no,sunny,mild,high,FALSE

yes,rainy,cool,normal,FALSE

yes,sunny,cool,normal,FALSE

yes,rainy,cool,normal,FALSE

yes,sunny,hot,normal,FALSE

yes,overcast,mild,high,TRUE

no,sunny,mild,high,TRUE

The CSV file contains five attributes: Play, Outlook, Temperature, Humidity, and Wind. Play would be the output variable (or the predicted class), and Outlook, Temperature, Humidity, and Wind would be the input variables. In R, read the data from the CSV file in the working directory and display the content.

play_decision <- read.table(“DTdata.csv”,header=TRUE,sep=”,”)

play_decision

Play Outlook Temperature Humidity Wind

1 yes rainy cool normal FALSE

2 no rainy cool normal TRUE

3 yes overcast hot high FALSE

4 no sunny mild high FALSE

5 yes rainy cool normal FALSE

6 yes sunny cool normal FALSE

7 yes rainy cool normal FALSE

8 yes sunny hot normal FALSE

9 yes overcast mild high TRUE

10 no sunny mild high TRUE

Display a summary of play_decision.

summary(play_decision)

Play Outlook Temperature Humidity Wind

no :3 overcast:2 cool:5 high :4 Mode :logical

yes:7 rainy :4 hot :2 normal:6 FALSE:7

sunny :4 mild:3 TRUE :3

NA’s :0

The rpart function builds a model of recursive partitioning and regression trees [9]. The following code snippet shows how to use the rpart function to construct a decision tree.

fit <- rpart(Play ˜ Outlook + Temperature + Humidity + Wind,

method=“class”,

data=play_decision,

control=rpart.control(minsplit=1),

parms=list(split=‘information’))

The rpart function has four parameters. The first parameter, Play ˜ Outlook + Temperature + Humidity + Wind, is the model indicating that attribute Play can be predicted based on attributes Outlook, Temperature, Humidity, and Wind. The second parameter, method, is set to “class,” telling R it is building a classification tree. The third parameter, data, specifies the dataframe containing those attributes mentioned in the formula. The fourth parameter, control, is optional and controls the tree growth. In the preceding example, control=rpart.control(minsplit=1) requires that each node have at least one observation before attempting a split. The minsplit=1 makes sense for the small dataset, but for larger datasets minsplit could be set to 10% of the dataset size to combat overfitting. Besides minsplit, other parameters are available to control the construction of the decision tree. For example, rpart.control(maxdepth=10,cp=0.001) limits the depth of the tree to no more than 10, and a split must decrease the overall lack of fit by a factor of 0.001 before being attempted. The last parameter (parms) specifies the purity measure being used for the splits. The value of split can be either information (for using the information gain) or gini (for using the Gini index).

Enter summary(fit) to produce a summary of the model built from rpart.

The output includes a summary of every node in the constructed decision tree. If a node is a leaf, the output includes both the predicted class label (yes or no for Play) and the class probabilities—P(Play). The leaf nodes include node numbers 4, 5, 6, and 7. If a node is internal, the output in addition displays the number of observations that lead to each child node and the improvement that each attribute may bring for the next split. These internal nodes include numbers 1, 2, and 3.

summary(fit)

Call:

rpart(formula = Play ˜ Outlook + Temperature + Humidity + Wind,

data = play_decision, method = “class”,

control = rpart.control(minsplit = 1))

n= 10

CP nsplit rel error xerror xstd

1 0.3333333 0 1 1.000000 0.4830459

2 0.0100000 3 0 1.666667 0.5270463

Variable importance

Wind Outlook Temperature

51 29 20

Node number 1: 10 observations, complexity param=0.3333333

predicted class=yes expected loss=0.3 P(node) =1

class counts: 3 7

probabilities: 0.300 0.700

left son=2 (3 obs) right son=3 (7 obs)

Primary splits:

Temperature splits as RRL, improve=1.3282860, (0 missing)

Wind < 0.5 to the right, improve=1.3282860, (0 missing)

Outlook splits as RLL, improve=0.8161371, (0 missing)

Humidity splits as LR, improve=0.6326870, (0 missing)

Surrogate splits:

Wind < 0.5 to the right, agree=0.8, adj=0.333, (0 split)

Node number 2: 3 observations, complexity param=0.3333333

predicted class=no expected loss=0.3333333 P(node) =0.3

class counts: 2 1

probabilities: 0.667 0.333

left son=4 (2 obs) right son=5 (1 obs)

Primary splits:

Outlook splits as R-L, improve=1.9095430, (0 missing)

Wind < 0.5 to the left, improve=0.5232481, (0 missing)

Node number 3: 7 observations, complexity param=0.3333333

predicted class=yes expected loss=0.1428571 P(node) =0.7

class counts: 1 6

probabilities: 0.143 0.857

left son=6 (1 obs) right son=7 (6 obs)

Primary splits:

Wind < 0.5 to the right, improve=2.8708140, (0 missing)

Outlook splits as RLR, improve=0.6214736, (0 missing)

Temperature splits as LR-, improve=0.3688021, (0 missing)

Humidity splits as RL, improve=0.1674470, (0 missing)

Node number 4: 2 observations

predicted class=no expected loss=0 P(node) =0.2

class counts: 2 0

probabilities: 1.000 0.000

Node number 5: 1 observations

predicted class=yes expected loss=0 P(node) =0.1

class counts: 0 1

probabilities: 0.000 1.000

Node number 6: 1 observations

predicted class=no expected loss=0 P(node) =0.1

class counts: 1 0

probabilities: 1.000 0.000

Node number 7: 6 observations

predicted class=yes expected loss=0 P(node) =0.6

class counts: 0 6

probabilities: 0.000 1.000

The output produced by the summary is difficult to read and comprehend. The rpart.plot() function from the rpart.plot package can visually represent the output in a decision tree. Enter the following command to see the help file of rpart.plot: ?rpart.plot

Enter the following R code to plot the tree based on the model being built. The resulting tree is shown in Figure 7.9. Each node of the tree is labeled as either yes or no referring to the Play action of whether to play outside. Note that, by default, R has converted the values of Wind (True/False) into numbers.

rpart.plot(fit, type=4, extra=1)

Figure 7.9 A decision tree built from DTdata.csv

The decisions in Figure 7.9 are abbreviated. Use the following command to spell out the full names and display the classification rate at each node.

rpart.plot(fit, type=4, extra=2, clip.right.labs=FALSE,

varlen=0, faclen=0)

The decision tree can be used to predict outcomes for new datasets. Consider a testing set that contains the following record.

Outlook=“rainy”, Temperature=“mild”, Humidity=“high”, Wind=FALSE

The goal is to predict the play decision of this record. The following code loads the data into R as a data frame newdata. Note that the training set does not contain this case.

newdata <- data.frame(Outlook=“rainy”, Temperature=“mild”,

Humidity=“high”, Wind=FALSE)

newdata

Outlook Temperature Humidity Wind

1 rainy mild high FALSE

Next, use the predict function to generate predictions from a fitted rpart object. The format of the predict function follows.

predict(object, newdata = list(),

type = c(“vector”, “prob”, “class”, “matrix”))

Parameter type is a character string denoting the type of the predicted value. Set it to either prob or class to predict using a decision tree model and receive the result as either the class probabilities or just the class. The output shows that one instance is classified as Play=no, and zero instances are classified as Play=yes. Therefore, in both cases, the decision tree predicts that the play decision of the testing set is not to play.

predict(fit,newdata=newdata,type=“prob”)

no yes

1 1 0

predict(fit,newdata=newdata,type=“class”)

Levels: no yes

Wednesday, March 16, 2022

Data Science and Big Data Analytics

Discovering, Analyzing, Visualizing and Presenting Data

Advanced Analytical Theory and Methods: Regression

6.2 Logistic Regression

In linear regression modeling, the outcome variable is a continuous variable. As seen in the earlier Income example, linear regression can be used to model the relationship between age and education to income. Suppose a person’s actual income was not of interest, but rather whether someone was wealthy or poor. In such a case, when the outcome variable is categorical in nature, logistic regression can be used to predict the likelihood of an outcome based on the input variables. Although logistic regression can be applied to an outcome variable that represents multiple values, the following discussion examines the case in which the outcome variable represents two values such as true/false, pass/fail, or yes/no.

For example, a logistic regression model can be built to determine if a person will or will not purchase a new automobile in the next 12 months. The training set could include input variables for a person’s age, income, and gender as well as the age of an existing automobile. The training set would also include the outcome variable on whether the person purchased a new automobile over a 12-month period. The logistic regression model provides the likelihood or probability of a person making a purchase in the next 12 months. After examining a few more use cases for logistic regression, the remaining portion of this chapter examines how to build and evaluate a logistic regression model.

6.2.1 Use Cases

The logistic regression model is applied to a variety of situations in both the public and the private sector. Some common ways that the logistic regression model is used include the following:

• Medical: Develop a model to determine the likelihood of a patient’s successful response to a specific medical treatment or procedure. Input variables could include age, weight, blood pressure, and cholesterol levels.

• Finance: Using a loan applicant’s credit history and the details on the loan, determine the probability that an applicant will default on the loan. Based on the prediction, the loan can be approved or denied, or the terms can be modified.

• Marketing: Determine a wireless customer’s probability of switching carriers (known as churning) based on age, number of family members on the plan, months remaining on the existing contract, and social network contacts. With such insight, target the high-probability customers with appropriate offers to prevent churn.

• Engineering: Based on operating conditions and various diagnostic measurements, determine the probability of a mechanical part experiencing a malfunction or failure. With this probability estimate, schedule the appropriate preventive maintenance activity.

6.2.2 Model Description

Logistic regression is based on the logistic function , as given in Equation 6.7.

Figure 6.14 The logistic function

Because the range of is (0, 1), the logistic function appears to be an appropriate function to model the probability of a particular outcome occurring. As the value of y increases, the probability of the outcome occurring increases. In any proposed model, to predict the likelihood of an outcome, y needs to be a function of the input variables. In logistic regression, y is expressed as a linear function of the input variables. In other words, the formula shown in Equation 6.8 applies.

The quantity , in (p/1-p) Equation 6.10 is known as the log odds ratio, or the logit of p. Techniques such as Maximum Likelihood Estimation (MLE) are used to estimate the model parameters. MLE determines the values of the model parameters that maximize the chances of observing the given dataset. However, the specifics of implementing MLE are beyond the scope of this book.

The following example helps to clarify the logistic regression model. The mechanics of using R to fit a logistic regression model are covered in the next section on evaluating the fitted model. In this section, the discussion focuses on interpreting the fitted model.

Customer Churn Example

A wireless telecommunications company wants to estimate the probability that a customer will churn (switch to a different company) in the next six months. With a reasonably accurate prediction of a person’s likelihood of churning, the sales and marketing groups can attempt to retain the customer by offering various incentives. Data on 8,000 current and prior customers was obtained. The variables collected for each customer follow:

• Age (years)

• Married (true/false)

• Duration as a customer (years)

• Churned_contacts (count)—Number of the customer’s contacts that have churned (count)

• Churned (true/false)—Whether the customer churned

After analyzing the data and fitting a logistic regression model, Age and Churned_contacts were selected as the best predictor variables. Equation 6.11 provides the estimated model parameters.

6.11 y = 3.50 – 0.16 * Age + 0.38 * Churned_contacts

Using the fitted model from Equation 6.11, Table 6.1 provides the probability of a customer churning based on the customer’s age and the number of churned contacts. The computed values of are also provided in the table. Recalling the previous discussion of the logistic function, as the value of y increases, so does the probability of churning.

Table 6.1 Estimated Churn Probabilities

Based on the fitted model, there is a 93% chance that a 20-year-old customer who has had six contacts churn will also churn. (See the last row of Table 6.1.) Examining the sign and values of the estimated coefficients in Equation 6.11, it is observed that as the value of Age increases, the value of y decreases. Thus, the negative Age coefficient indicates that the probability of churning decreases for an older customer. On the other hand, based on the positive sign of the Churned_Contacts coefficient, the value of y and subsequently the probability of churning increases as the number of churned contacts increases.

6.2.3 Diagnostics

The churn example illustrates how to interpret a fitted logistic regression model. Using R, this section examines the steps to develop a logistic regression model and evaluate the model’s effectiveness. For this example, the churn_input data frame is structured as follows:

head(churn_input)

ID Churned Age Married Cust_years Churned_contacts

1 1 0 61 1 3 1

2 2 0 50 1 3 2

3 3 0 47 1 2 0

4 4 0 50 1 3 3

5 5 0 29 1 1 3

6 6 0 43 1 4 3

A Churned value of 1 indicates that the customer churned. A Churned value of 0 indicates that the customer remained as a subscriber. Out of the 8,000 customer records in this dataset, 1,743 customers (˜22%) churned.

sum(churn_input$Churned)

[1] 1743

Using the Generalized Linear Model function, glm(), in R and the specified family/link, a logistic regression model can be applied to the variables in the dataset and examined as follows:

Churn_logistic1 <- glm (Churned˜Age + Married + Cust_years +

Churned_contacts, data=churn_input,

family=binomial(link=“logit”))

summary(Churn_logistic1)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.415201 0.163734 20.858 <2e-16 ***

Age -0.156643 0.004088 -38.320 <2e-16 ***

Married 0.066432 0.068302 0.973 0.331

Cust_years 0.017857 0.030497 0.586 0.558

Churned_contacts 0.382324 0.027313 13.998 <2e-16 ***

–

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

As in the linear regression case, there are tests to determine if the coefficients are significantly different from zero. Such significant coefficients correspond to small values of Pr(>|z|), which denote the p-value for the hypothesis test to determine if the estimated model parameter is significantly different from zero. Rerunning this analysis without the Cust_years variable, which had the largest corresponding p-value, yields the following:

Churn_logistic2 <- glm (Churned˜Age + Married + Churned_contacts,

data=churn_input, family=binomial(link=“logit”))

summary(Churn_logistic2)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.472062 0.132107 26.282 <2e-16 ***

Age -0.156635 0.004088 -38.318 <2e-16 ***

Married 0.066430 0.068299 0.973 0.331

Churned_contacts 0.381909 0.027302 13.988 <2e-16 ***

–

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

Because the p-value for the Married coefficient remains quite large, the Married variable is dropped from the model. The following R code provides the third and final model, which includes only the Age and Churned_contacts variables:

Churn_logistic3 <- glm (Churned˜Age + Churned_contacts,

data=churn_input, family=binomial(link=“logit”))

summary(Churn_logistic3)

Call:

glm(formula = Churned ˜ Age + Churned_contacts,

family = binomial(link = “logit”), data = churn_input)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.4599 -0.5214 -0.1960 -0.0736 3.3671

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.502716 0.128430 27.27 <2e-16 ***

Age -0.156551 0.004085 -38.32 <2e-16 ***

Churned_contacts 0.381857 0.027297 13.99 <2e-16 ***

–

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 8387.3 on 7999 degrees of freedom

Residual deviance: 5359.2 on 7997 degrees of freedom

AIC: 5365.2

Number of Fisher Scoring iterations: 6

For this final model, the entire summary output is provided. The output offers several values that can be used to evaluate the fitted model. It should be noted that the model parameter estimates correspond to the values provided in Equation 6.11 that were used to construct Table 6.1.

Deviance and the Pseudo-R²

In logistic regression, deviance is defined to be , where L is the maximized value of the likelihood function that was used to obtain the parameter estimates. In the R output, two deviance values are provided. The null deviance is the value where the likelihood function is based only on the intercept term ( ). The residual deviance is the value where the likelihood function is based on the parameters in the specified logistic model, shown in Equation 6.12.

6.12 y=β₀ + β₁ * Age + β₂ * Churned_contacts

A metric analogous to R 2 in linear regression can be computed as shown in Equation 6.13.

6.13 pseudo-R² = 1 – residual dev/null dev = null dev. – res. dev./null dev.

The pseudo-R² is a measure of how well the fitted model explains the data as compared to the default model of no predictor variables and only an intercept term. A pseudo-R² value near 1 indicates a good fit over the simple null model.

Deviance and the Log-Likelihood Ratio Test

In the pseudo-R² calculation, the –2 multipliers simply divide out. So, it may appear that including such a multiplier does not provide a benefit. However, the multiplier in the deviance definition is based on the log-likelihood test statistic shown in Equation 6.14:

where p is the number of parameters in the fitted model.

So, in a hypothesis test, a large value of would indicate that the fitted model is significantly better than the null model that uses only the intercept term.

In the churn example, the log-likelihood ratio statistic would be this:

T= 8387.3 – 5359.2 = 3028.1 with 2 degrees of freedom and a corresponding p-value that is essentially zero.

So far, the log-likelihood ratio test discussion has focused on comparing a fitted model to the default model of using only the intercept. However, the log-likelihood ratio test can also compare one fitted model to another. For example, consider the logistic regression model when the categorical variable Married is included with Age and Churned_contacts in the list of input variables. The partial R output for such a model is provided here:

summary(Churn_logistic2)

Call:

glm(formula = Churned ˜ Age + Married + Churned_contacts,

family = binomial(link = “logit”),

data = churn_input)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.472062 0.132107 26.282 <2e-16 ***

Age -0.156635 0.004088 -38.318 <2e-16 ***

Married 0.066430 0.068299 0.973 0.331

Churned_contacts 0.381909 0.027302 13.988 <2e-16 ***

–

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 8387.3 on 7999 degrees of freedom

Residual deviance: 5358.3 on 7996 degrees of freedom

The residual deviances from each model can be used to perform a hypothesis test where H_A : β_married = 0 against H_A : β_married ≠ 0 using the base model that includes the Age and Churned_contacts variables. The test statistic follows:

T= 5359.2 = 5358.3 =9 with 7997 – 7996 - degrees of freedom

Using R, the corresponding p-value is calculated as follows:

pchisq(.9 , 1, lower=FALSE)

[1] 0.3427817

Thus, at a 66% or higher confidence level, the null hypothesis,H₀ : β_married = 0 , would not be rejected. Thus, it seems reasonable to exclude the variable Married from the logistic regression model.

In general, this log-likelihood ratio test is particularly useful for forward and backward step-wise methods to add variables to or remove them from the proposed logistic regression model.

Receiver Operating Characteristic (ROC) Curve

Logistic regression is often used as a classifier to assign class labels to a person, item, or transaction based on the predicted probability provided by the model. In the Churn example, a customer can be classified with the label called Churn if the logistic model predicts a high probability that the customer will churn. Otherwise, a Remain label is assigned to the customer. Commonly, 0.5 is used as the default probability threshold to distinguish between any two class labels. However, any threshold value can be used depending on the preference to avoid false positives (for example, to predict Churn when actually the customer will Remain) or false negatives (for example, to predict Remain when the customer will actually Churn).

In general, for two class labels, C and ¬C, where “¬C” denotes “not C,” some working definitions and formulas follow:

• True Positive: predict C, when actually C

• True Negative: predict ¬C, when actually ¬C

• False Positive: predict C, when actually ¬C

• False Negative: predict ¬C, when actually C

The plot of the True Positive Rate (TPR) against the False Positive Rate (FPR) is known as the Receiver Operating Characteristic (ROC) curve. Using the ROCR package, the following R commands generate the ROC curve for the Churn example:

library(ROCR)

pred = predict(Churn_logistic3, type=“response”)

predObj = prediction(pred, churn_input$Churned )

rocObj = performance(predObj, measure=“tpr”, x.measure=“fpr”)

aucObj = performance(predObj, measure=“auc”)

plot(rocObj, main = paste(“Area under the curve:”,

round(aucObj@y.values[[1]] ,4)))

The usefulness of this plot in Figure 6.15 is that the preferred outcome of a classifier is to have a low FPR and a high TPR. So, when moving from left to right on the FPR axis, a good model/ classifier has the TPR rapidly approach values near 1, with only a small change in FPR. The closer the ROC curve tracks along the vertical axis and approaches the upper-left hand of the plot, near the point (0,1), the better the model/classifier performs. Thus, a useful metric is to compute the area under the ROC curve (AUC). By examining the axes, it can be seen that the theoretical maximum for the area is 1.

Figure 6.15 ROC curve for the churn example

To illustrate how the FPR and TPR values are dependent on the threshold value used for the classifier, the plot in Figure 6.16 was constructed using the following R code:

# extract the alpha(threshold), FPR, and TPR values from rocObj

alpha <- round(as.numeric(unlist(rocObj@alpha.values)),4)

fpr <- round(as.numeric(unlist(rocObj@x.values)),4)

tpr <- round(as.numeric(unlist(rocObj@y.values)),4)

# adjust margins and plot TPR and FPR

par(mar = c(5,5,2,5))

plot(alpha,tpr, xlab=“Threshold”, xlim=c(0,1),

ylab=“True positive rate”, type=“l”)

par(new=“True”)

plot(alpha,fpr, xlab=””, ylab=””, axes=F, xlim=c(0,1), type=“l” )

axis(side=4)

mtext(side=4, line=3, “False positive rate”)

text(0.18,0.18,“FPR”)

text(0.58,0.58,“TPR”)

Figure 6.16 The effect of the threshold value in the churn example

For a threshold value of 0, every item is classified as a positive outcome. Thus, the TPR value is 1. However, all the negatives are also classified as a positive, and the FPR value is also 1. As the threshold value increases, more and more negative class labels are assigned. Thus, the FPR and TPR values decrease. When the threshold reaches 1, no positive labels are assigned, and the FPR and TPR values are both 0.

For the purposes of a classifier, a commonly used threshold value is 0.5. A positive label is assigned for any probability of 0.5 or greater. Otherwise, a negative label is assigned. As the following R code illustrates, in the analysis of the Churn dataset, the 0.5 threshold corresponds to a TPR value of 0.56 and a FPR value of 0.08.

i <- which(round(alpha,2) == .5)

paste(“Threshold=” , (alpha[i]) , ” TPR=” , tpr[i] , ” FPR=” , fpr[i])

[1] “Threshold= 0.5004 TPR= 0.5571 FPR= 0.0793”

Thus, 56% of customers who will churn are properly classified with the Churn label, and 8% of the customers who will remain as customers are improperly labeled as Churn. If identifying only 56% of the churners is not acceptable, then the threshold could be lowered. For example, suppose it was decided to classify with a Churn label any customer with a probability of churning greater than 0.15. Then the following R code indicates that the corresponding TPR and FPR values are 0.91 and 0.29, respectively. Thus, 91% of the customers who will churn are properly identified, but at a cost of misclassifying 29% of the customers who will remain.

i <- which(round(alpha,2) == .15)

paste(“Threshold=” , (alpha[i]) , ” TPR=” , tpr[i] , ” FPR=” , fpr[i])

[1] “Threshold= 0.1543 TPR= 0.9116 FPR= 0.2869”

[2] “Threshold= 0.1518 TPR= 0.9122 FPR= 0.2875”

[3] “Threshold= 0.1479 TPR= 0.9145 FPR= 0.2942”

[4] “Threshold= 0.1455 TPR= 0.9174 FPR= 0.2981”

The ROC curve is useful for evaluating other classifiers and will be utilized again in Chapter 7, “Advanced Analytical Theory and Methods: Classification.”

Histogram of the Probabilities

It can be useful to visualize the observed responses against the estimated probabilities provided by the logistic regression. Figure 6.17 provides overlaying histograms for the customers who churned and for the customers who remained as customers. With a proper fitting logistic model, the customers who remained tend to have a low probability of churning. Conversely, the customers who churned have a high probability of churning again. This histogram plot helps visualize the number of items to be properly classified or misclassified. In the Churn example, an ideal histogram plot would have the remaining customers grouped at the left side of the plot, the customers who churned at the right side of the plot, and no overlap of these two groups.

Figure 6.17 Customer counts versus estimated churn probability

6.3 Reasons to Choose and Cautions

Linear regression is suitable when the input variables are continuous or discrete, including categorical data types, but the outcome variable is continuous. If the outcome variable is categorical, logistic regression is a better choice.

Both models assume a linear additive function of the input variables. If such an assumption does not hold true, both regression techniques perform poorly. Furthermore, in linear regression, the assumption of normally distributed error terms with a constant variance is important for many of the statistical inferences that can be considered. If the various assumptions do not appear to hold, the appropriate transformations need to be applied to the data.

Although a collection of input variables may be a good predictor for the outcome variable, the analyst should not infer that the input variables directly cause an outcome. For example, it may be identified that those individuals who have regular dentist visits may have a reduced risk of heart attacks. However, simply sending someone to the dentist almost certainly has no effect on that person’s chance of having a heart attack. It is possible that regular dentist visits may indicate a person’s overall health and dietary choices, which may have a more direct impact on a person’s health. This example illustrates the commonly known expression, “Correlation does not imply causation.”

Use caution when applying an already fitted model to data that falls outside the dataset used to train the model. The linear relationship in a regression model may no longer hold at values outside the training dataset. For example, if income was an input variable and the values of income ranged from $35,000 to $90,000, applying the model to incomes well outside those incomes could result in inaccurate estimates and predictions.

The income regression example in Section 6.1.2 mentioned the possibility of using categorical variables to represent the 50 U.S. states. In a linear regression model, the state of residence would provide a simple additive term to the income model but no other impact on the coefficients of the other input variables, such as Age and Education. However, if state does influence the other variables’ impact to the income model, an alternative approach would be to build 50 separate linear regression models: one model for each state. Such an approach is an example of the options and decisions that the data scientist must be willing to consider.

If several of the input variables are highly correlated to each other, the condition is known as multicollinearity. Multicollinearity can often lead to coefficient estimates that are relatively large in absolute magnitude and may be of inappropriate direction (negative or positive sign). When possible, the majority of these correlated variables should be removed from the model or replaced by a new variable that is a function of the correlated variables. For example, in a medical application of regression, height and weight may be considered important input variables, but these variables tend to be correlated. In this case, it may be useful to use the Body Mass Index (BMI), which is a function of a person’s height and weight.

BMI = Weight/height² where weight is in kilograms and height is in meters.

However, in some cases it may be necessary to use the correlated variables. The next section provides some techniques to address highly correlated variables.

6.4 Additional Regression Models

In the case of multicollinearity, it may make sense to place some restrictions on the magnitudes of the estimated coefficients. Ridge regression, which applies a penalty based on the size of the coefficients, is one technique that can be applied. In fitting a linear regression model, the objective is to find the values of the coefficients that minimize the sum of the residuals squared. In ridge regression, a penalty term proportional to the sum of the squares of the coefficients is added to the sum of the residuals squared. Lasso regression is a related modeling technique in which the penalty is proportional to the sum of the absolute values of the coefficients.

Only binary outcome variables were examined in the use of logistic regression. If the outcome variable can assume more than two states, multinomial logistic regression can be used.

Summary

This chapter discussed the use of linear regression and logistic regression to model historical data and to predict future outcomes. Using R, examples of each regression technique were presented. Several diagnostics to evaluate the models and the underlying assumptions were covered.

Although regression analysis is relatively straightforward to perform using many existing software packages, considerable care must be taken in performing and interpreting a regression analysis. This chapter highlighted that in a regression analysis, the data scientist needs to do the following:

• Determine the best input variables and their relationship to the outcome variable .

• Understand the underlying assumptions and their impact on the modeling results.

• Transform the variables, as appropriate, to achieve adherence to the model assumptions.

• Decide whether building one comprehensive model is the best choice or consider building many models on partitions of the data.

Exercises

1. In the Income linear regression example, consider the distribution of the outcome variable Income. Income values tend to be highly skewed to the right (distribution of value has a large tail to the right). Does such a non-normally distributed outcome variable violate the general assumption of a linear regression model? Provide supporting arguments.

2. In the use of a categorical variable with n possible values, explain the following:

1. Why only n – 1 binary variables are necessary

2. Why using n variables would be problematic

3. In the example of using Wyoming as the reference case, discuss the effect on the estimated model parameters, including the intercept, if another state was selected as the reference case.

4. Describe how logistic regression can be used as a classifier.

5. Discuss how the ROC curve can be used to determine an appropriate threshold value for a classifier.

6. If the probability of an event occurring is 0.4, then

1. (a)What is the odds ratio?

2. What is the log odds ratio?

7. If B₃ = -.5 is an estimated coefficient in a linear regression model, what is the effect on the odds ratio for every one unit increase in the value of X₃ ?

Tuesday, March 22, 2022

Data Science and Big Data Analytics

Wednesday, March 16, 2022

Data Science and Big Data Analytics

Labels

INSTRUMENTATION MANUFACTURERS