Industries Needs: Data Science and Big Data Analytics

Discovering, Analyzing, Visualizing and Presenting Data

Advanced Analytical Theory and Methods: Regression

6.2 Logistic Regression

In linear regression modeling, the outcome variable is a continuous variable. As seen in the earlier Income example, linear regression can be used to model the relationship between age and education to income. Suppose a person’s actual income was not of interest, but rather whether someone was wealthy or poor. In such a case, when the outcome variable is categorical in nature, logistic regression can be used to predict the likelihood of an outcome based on the input variables. Although logistic regression can be applied to an outcome variable that represents multiple values, the following discussion examines the case in which the outcome variable represents two values such as true/false, pass/fail, or yes/no.

For example, a logistic regression model can be built to determine if a person will or will not purchase a new automobile in the next 12 months. The training set could include input variables for a person’s age, income, and gender as well as the age of an existing automobile. The training set would also include the outcome variable on whether the person purchased a new automobile over a 12-month period. The logistic regression model provides the likelihood or probability of a person making a purchase in the next 12 months. After examining a few more use cases for logistic regression, the remaining portion of this chapter examines how to build and evaluate a logistic regression model.

6.2.1 Use Cases

The logistic regression model is applied to a variety of situations in both the public and the private sector. Some common ways that the logistic regression model is used include the following:

• Medical: Develop a model to determine the likelihood of a patient’s successful response to a specific medical treatment or procedure. Input variables could include age, weight, blood pressure, and cholesterol levels.

• Finance: Using a loan applicant’s credit history and the details on the loan, determine the probability that an applicant will default on the loan. Based on the prediction, the loan can be approved or denied, or the terms can be modified.

• Marketing: Determine a wireless customer’s probability of switching carriers (known as churning) based on age, number of family members on the plan, months remaining on the existing contract, and social network contacts. With such insight, target the high-probability customers with appropriate offers to prevent churn.

• Engineering: Based on operating conditions and various diagnostic measurements, determine the probability of a mechanical part experiencing a malfunction or failure. With this probability estimate, schedule the appropriate preventive maintenance activity.

6.2.2 Model Description

Logistic regression is based on the logistic function , as given in Equation 6.7.

Figure 6.14 The logistic function

Because the range of is (0, 1), the logistic function appears to be an appropriate function to model the probability of a particular outcome occurring. As the value of y increases, the probability of the outcome occurring increases. In any proposed model, to predict the likelihood of an outcome, y needs to be a function of the input variables. In logistic regression, y is expressed as a linear function of the input variables. In other words, the formula shown in Equation 6.8 applies.

The quantity , in (p/1-p) Equation 6.10 is known as the log odds ratio, or the logit of p. Techniques such as Maximum Likelihood Estimation (MLE) are used to estimate the model parameters. MLE determines the values of the model parameters that maximize the chances of observing the given dataset. However, the specifics of implementing MLE are beyond the scope of this book.

The following example helps to clarify the logistic regression model. The mechanics of using R to fit a logistic regression model are covered in the next section on evaluating the fitted model. In this section, the discussion focuses on interpreting the fitted model.

Customer Churn Example

A wireless telecommunications company wants to estimate the probability that a customer will churn (switch to a different company) in the next six months. With a reasonably accurate prediction of a person’s likelihood of churning, the sales and marketing groups can attempt to retain the customer by offering various incentives. Data on 8,000 current and prior customers was obtained. The variables collected for each customer follow:

• Age (years)

• Married (true/false)

• Duration as a customer (years)

• Churned_contacts (count)—Number of the customer’s contacts that have churned (count)

• Churned (true/false)—Whether the customer churned

After analyzing the data and fitting a logistic regression model, Age and Churned_contacts were selected as the best predictor variables. Equation 6.11 provides the estimated model parameters.

6.11 y = 3.50 – 0.16 * Age + 0.38 * Churned_contacts

Using the fitted model from Equation 6.11, Table 6.1 provides the probability of a customer churning based on the customer’s age and the number of churned contacts. The computed values of are also provided in the table. Recalling the previous discussion of the logistic function, as the value of y increases, so does the probability of churning.

Table 6.1 Estimated Churn Probabilities

Based on the fitted model, there is a 93% chance that a 20-year-old customer who has had six contacts churn will also churn. (See the last row of Table 6.1.) Examining the sign and values of the estimated coefficients in Equation 6.11, it is observed that as the value of Age increases, the value of y decreases. Thus, the negative Age coefficient indicates that the probability of churning decreases for an older customer. On the other hand, based on the positive sign of the Churned_Contacts coefficient, the value of y and subsequently the probability of churning increases as the number of churned contacts increases.

6.2.3 Diagnostics

The churn example illustrates how to interpret a fitted logistic regression model. Using R, this section examines the steps to develop a logistic regression model and evaluate the model’s effectiveness. For this example, the churn_input data frame is structured as follows:

head(churn_input)

ID Churned Age Married Cust_years Churned_contacts

1 1 0 61 1 3 1

2 2 0 50 1 3 2

3 3 0 47 1 2 0

4 4 0 50 1 3 3

5 5 0 29 1 1 3

6 6 0 43 1 4 3

A Churned value of 1 indicates that the customer churned. A Churned value of 0 indicates that the customer remained as a subscriber. Out of the 8,000 customer records in this dataset, 1,743 customers (˜22%) churned.

sum(churn_input$Churned)

[1] 1743

Using the Generalized Linear Model function, glm(), in R and the specified family/link, a logistic regression model can be applied to the variables in the dataset and examined as follows:

Churn_logistic1 <- glm (Churned˜Age + Married + Cust_years +

Churned_contacts, data=churn_input,

family=binomial(link=“logit”))

summary(Churn_logistic1)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.415201 0.163734 20.858 <2e-16 ***

Age -0.156643 0.004088 -38.320 <2e-16 ***

Married 0.066432 0.068302 0.973 0.331

Cust_years 0.017857 0.030497 0.586 0.558

Churned_contacts 0.382324 0.027313 13.998 <2e-16 ***

–

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

As in the linear regression case, there are tests to determine if the coefficients are significantly different from zero. Such significant coefficients correspond to small values of Pr(>|z|), which denote the p-value for the hypothesis test to determine if the estimated model parameter is significantly different from zero. Rerunning this analysis without the Cust_years variable, which had the largest corresponding p-value, yields the following:

Churn_logistic2 <- glm (Churned˜Age + Married + Churned_contacts,

data=churn_input, family=binomial(link=“logit”))

summary(Churn_logistic2)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.472062 0.132107 26.282 <2e-16 ***

Age -0.156635 0.004088 -38.318 <2e-16 ***

Married 0.066430 0.068299 0.973 0.331

Churned_contacts 0.381909 0.027302 13.988 <2e-16 ***

–

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

Because the p-value for the Married coefficient remains quite large, the Married variable is dropped from the model. The following R code provides the third and final model, which includes only the Age and Churned_contacts variables:

Churn_logistic3 <- glm (Churned˜Age + Churned_contacts,

data=churn_input, family=binomial(link=“logit”))

summary(Churn_logistic3)

Call:

glm(formula = Churned ˜ Age + Churned_contacts,

family = binomial(link = “logit”), data = churn_input)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.4599 -0.5214 -0.1960 -0.0736 3.3671

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.502716 0.128430 27.27 <2e-16 ***

Age -0.156551 0.004085 -38.32 <2e-16 ***

Churned_contacts 0.381857 0.027297 13.99 <2e-16 ***

–

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 8387.3 on 7999 degrees of freedom

Residual deviance: 5359.2 on 7997 degrees of freedom

AIC: 5365.2

Number of Fisher Scoring iterations: 6

For this final model, the entire summary output is provided. The output offers several values that can be used to evaluate the fitted model. It should be noted that the model parameter estimates correspond to the values provided in Equation 6.11 that were used to construct Table 6.1.

Deviance and the Pseudo-R²

In logistic regression, deviance is defined to be , where L is the maximized value of the likelihood function that was used to obtain the parameter estimates. In the R output, two deviance values are provided. The null deviance is the value where the likelihood function is based only on the intercept term ( ). The residual deviance is the value where the likelihood function is based on the parameters in the specified logistic model, shown in Equation 6.12.

6.12 y=β₀ + β₁ * Age + β₂ * Churned_contacts

A metric analogous to R 2 in linear regression can be computed as shown in Equation 6.13.

6.13 pseudo-R² = 1 – residual dev/null dev = null dev. – res. dev./null dev.

The pseudo-R² is a measure of how well the fitted model explains the data as compared to the default model of no predictor variables and only an intercept term. A pseudo-R² value near 1 indicates a good fit over the simple null model.

Deviance and the Log-Likelihood Ratio Test

In the pseudo-R² calculation, the –2 multipliers simply divide out. So, it may appear that including such a multiplier does not provide a benefit. However, the multiplier in the deviance definition is based on the log-likelihood test statistic shown in Equation 6.14:

where p is the number of parameters in the fitted model.

So, in a hypothesis test, a large value of would indicate that the fitted model is significantly better than the null model that uses only the intercept term.

In the churn example, the log-likelihood ratio statistic would be this:

T= 8387.3 – 5359.2 = 3028.1 with 2 degrees of freedom and a corresponding p-value that is essentially zero.

So far, the log-likelihood ratio test discussion has focused on comparing a fitted model to the default model of using only the intercept. However, the log-likelihood ratio test can also compare one fitted model to another. For example, consider the logistic regression model when the categorical variable Married is included with Age and Churned_contacts in the list of input variables. The partial R output for such a model is provided here:

summary(Churn_logistic2)

Call:

glm(formula = Churned ˜ Age + Married + Churned_contacts,

family = binomial(link = “logit”),

data = churn_input)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.472062 0.132107 26.282 <2e-16 ***

Age -0.156635 0.004088 -38.318 <2e-16 ***

Married 0.066430 0.068299 0.973 0.331

Churned_contacts 0.381909 0.027302 13.988 <2e-16 ***

–

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 8387.3 on 7999 degrees of freedom

Residual deviance: 5358.3 on 7996 degrees of freedom

The residual deviances from each model can be used to perform a hypothesis test where H_A : β_married = 0 against H_A : β_married ≠ 0 using the base model that includes the Age and Churned_contacts variables. The test statistic follows:

T= 5359.2 = 5358.3 =9 with 7997 – 7996 - degrees of freedom

Using R, the corresponding p-value is calculated as follows:

pchisq(.9 , 1, lower=FALSE)

[1] 0.3427817

Thus, at a 66% or higher confidence level, the null hypothesis,H₀ : β_married = 0 , would not be rejected. Thus, it seems reasonable to exclude the variable Married from the logistic regression model.

In general, this log-likelihood ratio test is particularly useful for forward and backward step-wise methods to add variables to or remove them from the proposed logistic regression model.

Receiver Operating Characteristic (ROC) Curve

Logistic regression is often used as a classifier to assign class labels to a person, item, or transaction based on the predicted probability provided by the model. In the Churn example, a customer can be classified with the label called Churn if the logistic model predicts a high probability that the customer will churn. Otherwise, a Remain label is assigned to the customer. Commonly, 0.5 is used as the default probability threshold to distinguish between any two class labels. However, any threshold value can be used depending on the preference to avoid false positives (for example, to predict Churn when actually the customer will Remain) or false negatives (for example, to predict Remain when the customer will actually Churn).

In general, for two class labels, C and ¬C, where “¬C” denotes “not C,” some working definitions and formulas follow:

• True Positive: predict C, when actually C

• True Negative: predict ¬C, when actually ¬C

• False Positive: predict C, when actually ¬C

• False Negative: predict ¬C, when actually C

The plot of the True Positive Rate (TPR) against the False Positive Rate (FPR) is known as the Receiver Operating Characteristic (ROC) curve. Using the ROCR package, the following R commands generate the ROC curve for the Churn example:

library(ROCR)

pred = predict(Churn_logistic3, type=“response”)

predObj = prediction(pred, churn_input$Churned )

rocObj = performance(predObj, measure=“tpr”, x.measure=“fpr”)

aucObj = performance(predObj, measure=“auc”)

plot(rocObj, main = paste(“Area under the curve:”,

round(aucObj@y.values[[1]] ,4)))

The usefulness of this plot in Figure 6.15 is that the preferred outcome of a classifier is to have a low FPR and a high TPR. So, when moving from left to right on the FPR axis, a good model/ classifier has the TPR rapidly approach values near 1, with only a small change in FPR. The closer the ROC curve tracks along the vertical axis and approaches the upper-left hand of the plot, near the point (0,1), the better the model/classifier performs. Thus, a useful metric is to compute the area under the ROC curve (AUC). By examining the axes, it can be seen that the theoretical maximum for the area is 1.

Figure 6.15 ROC curve for the churn example

To illustrate how the FPR and TPR values are dependent on the threshold value used for the classifier, the plot in Figure 6.16 was constructed using the following R code:

# extract the alpha(threshold), FPR, and TPR values from rocObj

alpha <- round(as.numeric(unlist(rocObj@alpha.values)),4)

fpr <- round(as.numeric(unlist(rocObj@x.values)),4)

tpr <- round(as.numeric(unlist(rocObj@y.values)),4)

# adjust margins and plot TPR and FPR

par(mar = c(5,5,2,5))

plot(alpha,tpr, xlab=“Threshold”, xlim=c(0,1),

ylab=“True positive rate”, type=“l”)

par(new=“True”)

plot(alpha,fpr, xlab=””, ylab=””, axes=F, xlim=c(0,1), type=“l” )

axis(side=4)

mtext(side=4, line=3, “False positive rate”)

text(0.18,0.18,“FPR”)

text(0.58,0.58,“TPR”)

Figure 6.16 The effect of the threshold value in the churn example

For a threshold value of 0, every item is classified as a positive outcome. Thus, the TPR value is 1. However, all the negatives are also classified as a positive, and the FPR value is also 1. As the threshold value increases, more and more negative class labels are assigned. Thus, the FPR and TPR values decrease. When the threshold reaches 1, no positive labels are assigned, and the FPR and TPR values are both 0.

For the purposes of a classifier, a commonly used threshold value is 0.5. A positive label is assigned for any probability of 0.5 or greater. Otherwise, a negative label is assigned. As the following R code illustrates, in the analysis of the Churn dataset, the 0.5 threshold corresponds to a TPR value of 0.56 and a FPR value of 0.08.

i <- which(round(alpha,2) == .5)

paste(“Threshold=” , (alpha[i]) , ” TPR=” , tpr[i] , ” FPR=” , fpr[i])

[1] “Threshold= 0.5004 TPR= 0.5571 FPR= 0.0793”

Thus, 56% of customers who will churn are properly classified with the Churn label, and 8% of the customers who will remain as customers are improperly labeled as Churn. If identifying only 56% of the churners is not acceptable, then the threshold could be lowered. For example, suppose it was decided to classify with a Churn label any customer with a probability of churning greater than 0.15. Then the following R code indicates that the corresponding TPR and FPR values are 0.91 and 0.29, respectively. Thus, 91% of the customers who will churn are properly identified, but at a cost of misclassifying 29% of the customers who will remain.

i <- which(round(alpha,2) == .15)

paste(“Threshold=” , (alpha[i]) , ” TPR=” , tpr[i] , ” FPR=” , fpr[i])

[1] “Threshold= 0.1543 TPR= 0.9116 FPR= 0.2869”

[2] “Threshold= 0.1518 TPR= 0.9122 FPR= 0.2875”

[3] “Threshold= 0.1479 TPR= 0.9145 FPR= 0.2942”

[4] “Threshold= 0.1455 TPR= 0.9174 FPR= 0.2981”

The ROC curve is useful for evaluating other classifiers and will be utilized again in Chapter 7, “Advanced Analytical Theory and Methods: Classification.”

Histogram of the Probabilities

It can be useful to visualize the observed responses against the estimated probabilities provided by the logistic regression. Figure 6.17 provides overlaying histograms for the customers who churned and for the customers who remained as customers. With a proper fitting logistic model, the customers who remained tend to have a low probability of churning. Conversely, the customers who churned have a high probability of churning again. This histogram plot helps visualize the number of items to be properly classified or misclassified. In the Churn example, an ideal histogram plot would have the remaining customers grouped at the left side of the plot, the customers who churned at the right side of the plot, and no overlap of these two groups.

Figure 6.17 Customer counts versus estimated churn probability

6.3 Reasons to Choose and Cautions

Linear regression is suitable when the input variables are continuous or discrete, including categorical data types, but the outcome variable is continuous. If the outcome variable is categorical, logistic regression is a better choice.

Both models assume a linear additive function of the input variables. If such an assumption does not hold true, both regression techniques perform poorly. Furthermore, in linear regression, the assumption of normally distributed error terms with a constant variance is important for many of the statistical inferences that can be considered. If the various assumptions do not appear to hold, the appropriate transformations need to be applied to the data.

Although a collection of input variables may be a good predictor for the outcome variable, the analyst should not infer that the input variables directly cause an outcome. For example, it may be identified that those individuals who have regular dentist visits may have a reduced risk of heart attacks. However, simply sending someone to the dentist almost certainly has no effect on that person’s chance of having a heart attack. It is possible that regular dentist visits may indicate a person’s overall health and dietary choices, which may have a more direct impact on a person’s health. This example illustrates the commonly known expression, “Correlation does not imply causation.”

Use caution when applying an already fitted model to data that falls outside the dataset used to train the model. The linear relationship in a regression model may no longer hold at values outside the training dataset. For example, if income was an input variable and the values of income ranged from $35,000 to $90,000, applying the model to incomes well outside those incomes could result in inaccurate estimates and predictions.

The income regression example in Section 6.1.2 mentioned the possibility of using categorical variables to represent the 50 U.S. states. In a linear regression model, the state of residence would provide a simple additive term to the income model but no other impact on the coefficients of the other input variables, such as Age and Education. However, if state does influence the other variables’ impact to the income model, an alternative approach would be to build 50 separate linear regression models: one model for each state. Such an approach is an example of the options and decisions that the data scientist must be willing to consider.

If several of the input variables are highly correlated to each other, the condition is known as multicollinearity. Multicollinearity can often lead to coefficient estimates that are relatively large in absolute magnitude and may be of inappropriate direction (negative or positive sign). When possible, the majority of these correlated variables should be removed from the model or replaced by a new variable that is a function of the correlated variables. For example, in a medical application of regression, height and weight may be considered important input variables, but these variables tend to be correlated. In this case, it may be useful to use the Body Mass Index (BMI), which is a function of a person’s height and weight.

BMI = Weight/height² where weight is in kilograms and height is in meters.

However, in some cases it may be necessary to use the correlated variables. The next section provides some techniques to address highly correlated variables.

6.4 Additional Regression Models

In the case of multicollinearity, it may make sense to place some restrictions on the magnitudes of the estimated coefficients. Ridge regression, which applies a penalty based on the size of the coefficients, is one technique that can be applied. In fitting a linear regression model, the objective is to find the values of the coefficients that minimize the sum of the residuals squared. In ridge regression, a penalty term proportional to the sum of the squares of the coefficients is added to the sum of the residuals squared. Lasso regression is a related modeling technique in which the penalty is proportional to the sum of the absolute values of the coefficients.

Only binary outcome variables were examined in the use of logistic regression. If the outcome variable can assume more than two states, multinomial logistic regression can be used.

Summary

This chapter discussed the use of linear regression and logistic regression to model historical data and to predict future outcomes. Using R, examples of each regression technique were presented. Several diagnostics to evaluate the models and the underlying assumptions were covered.

Although regression analysis is relatively straightforward to perform using many existing software packages, considerable care must be taken in performing and interpreting a regression analysis. This chapter highlighted that in a regression analysis, the data scientist needs to do the following:

• Determine the best input variables and their relationship to the outcome variable .

• Understand the underlying assumptions and their impact on the modeling results.

• Transform the variables, as appropriate, to achieve adherence to the model assumptions.

• Decide whether building one comprehensive model is the best choice or consider building many models on partitions of the data.

Exercises

1. In the Income linear regression example, consider the distribution of the outcome variable Income. Income values tend to be highly skewed to the right (distribution of value has a large tail to the right). Does such a non-normally distributed outcome variable violate the general assumption of a linear regression model? Provide supporting arguments.

2. In the use of a categorical variable with n possible values, explain the following:

1. Why only n – 1 binary variables are necessary

2. Why using n variables would be problematic

3. In the example of using Wyoming as the reference case, discuss the effect on the estimated model parameters, including the intercept, if another state was selected as the reference case.

4. Describe how logistic regression can be used as a classifier.

5. Discuss how the ROC curve can be used to determine an appropriate threshold value for a classifier.

6. If the probability of an event occurring is 0.4, then

1. (a)What is the odds ratio?

2. What is the log odds ratio?

7. If B₃ = -.5 is an estimated coefficient in a linear regression model, what is the effect on the odds ratio for every one unit increase in the value of X₃ ?

Wednesday, March 16, 2022

Data Science and Big Data Analytics

No comments:

Post a Comment

Labels

INSTRUMENTATION MANUFACTURERS