google.com, pub-4497197638514141, DIRECT, f08c47fec0942fa0 Industries Needs: Data Science and Big Data Analytics

Friday, March 11, 2022

Data Science and Big Data Analytics

 

Discovering, Analyzing, Visualizing and Presenting Data

Advanced Analytical Theory and Methods: Regression

Key Concepts

1. Categorical Variable

2. Linear Regression 3

. Logistic Regression

4. Ordinary Least Squares (OLS)

5. Receiver Operating Characteristic (ROC) Curve

6. Residuals

In general, regression analysis attempts to explain the influence that a set of variables has on the outcome of another variable of interest. Often, the outcome variable is called a dependent variable because the outcome depends on the other variables. These additional variables are sometimes called the input variables or the independent variables. Regression analysis is useful for answering the following kinds of questions:

• What is a person’s expected income?  

• What is the probability that an applicant will default on a loan?

Linear regression is a useful tool for answering the first question, and logistic regression is a popular method for addressing the second. This chapter examines these two regression techniques and explains when one technique is more appropriate than the other.

Regression analysis is a useful explanatory tool that can identify the input variables that have the greatest statistical influence on the outcome. With such knowledge and insight, environmental changes can be attempted to produce more favorable values of the input variables. For example, if it is found that the reading level of 10-year-old students is an excellent predictor of the students’ success in high school and a factor in their attending college, then additional emphasis on reading can be considered, implemented, and evaluated to improve students’ reading levels at a younger age.

 

6.1 Linear Regression

Linear regression is an analytical technique used to model the relationship between several input variables and a continuous outcome variable. A key assumption is that the relationship between an input variable and the outcome variable is linear. Although this assumption may appear restrictive, it is often possible to properly transform the input or outcome variables to achieve a linear relationship between the modified input and outcome variables. Possible transformations will be covered in more detail later in the chapter.

The physical sciences have well-known linear models, such as Ohm’s Law, which states that the electrical current flowing through a resistive circuit is linearly proportional to the voltage applied to the circuit. Such a model is considered deterministic in the sense that if the input values are known, the value of the outcome variable is precisely determined. A linear regression model is a probabilistic one that accounts for the randomness that can affect any particular outcome. Based on known input values, a linear regression model provides the expected value of the outcome variable based on the values of the input variables, but some uncertainty may remain in predicting any particular outcome. Thus, linear regression models are useful in physical and social science applications where there may be considerable variation in a particular outcome based on a given set of input values. After presenting possible linear regression use cases, the foundations of linear regression modeling are provided.

6.1.1 Use Cases

Linear regression is often used in business, government, and other scenarios. Some common practical applications of linear regression in the real world include the following:

• Real estate: A simple linear regression analysis can be used to model residential home prices as a function of the home’s living area. Such a model helps set or evaluate the list price of a home on the market. The model could be further improved by including other input variables such as number of bathrooms, number of bedrooms, lot size, school district rankings, crime statistics, and property taxes.

• Demand forecasting: Businesses and governments can use linear regression models to predict demand for goods and services. For example, restaurant chains can appropriately prepare for the predicted type and quantity of food that customers will consume based upon the weather, the day of the week, whether an item is offered as a special, the time of day, and the reservation volume. Similar models can be built to predict retail sales, emergency room visits, and ambulance dispatches.

• Medical: A linear regression model can be used to analyze the effect of a proposed radiation treatment on reducing tumor sizes. Input variables might include duration of a single radiation treatment, frequency of radiation treatment, and patient attributes such as age or weight.

6.1.2 Model Description

As the name of this technique suggests, the linear regression model assumes that there is a linear relationship between the input variables and the outcome variable. This relationship can be expressed as shown in Equation 6.1.

 

y = β0 + β1x1 + β2x2 +…+ βp-1xp-1 + ϵ

 

where:

1.y  is the outcome variable

2. xj are the input variables, for j = 1, 2, …, p – 1

3. β0 is the value of y when each xj equals zero

4. β1 is the change in y based on a unit change in xj, for j = 1, 2, …, p – 1

5. ϵ  is a random error term that represents the difference in the linear model and a particular observed value for y

Suppose it is desired to build a linear regression model that estimates a person’s annual income as a function of two variables—age and education—both expressed in years. In this case, income is the outcome variable, and the input variables are age and education. Although it may be an over generalization, such a model seems intuitively correct in the sense that people’s income should increase as their skill set and experience expand with age. Also, the employment opportunities and starting salaries would be expected to be greater for those who have attained more education.

However, it is also obvious that there is considerable variation in income levels for a group of people with identical ages and years of education. This variation is represented by in the model. So, in this example, the model would be expressed as shown in Equation 6.2.

 

Incom = β0 + β1Age + β2Education + ϵ

 

In the linear model, βjSthe represent the unknown p parameters. The estimates for these unknown parameters are chosen so that, on average, the model provides a reasonable estimate of a person’s income based on age and education. In other words, the fitted model should minimize the overall error between the linear model and the actual observations. Ordinary Least Squares (OLS) is a common technique to estimate the parameters.

To illustrate how OLS works, suppose there is only one input variable, x, for an outcome variable y. Furthermore, n observations of (x,y) are obtained and plotted in Figure 6.1.

 


Figure 6.1 Scatterplot of y versus x

The goal is to find the line that best approximates the relationship between the outcome variable and the input variables. With OLS, the objective is to find the line through these points that minimizes the sum of the squares of the difference between each point and the line in the vertical direction. In other words, find the values of and such that the summation shown in Equation 6.3 is minimized.

 

The n individual distances to be squared and then summed are illustrated in Figure 6.2. The vertical lines represent the distance between each observed y value and the line y = β0 + β1x.

 


Figure 6.2 Scatterplot of y versus x with vertical distances from the observed points to a fitted line

In Figure 3.7 of Chapter 3, “Review of Basic Data Analytic Methods Using R,” the Anscombe’s Quartet example used OLS to fit the linear regression line to each of the four datasets. OLS for multiple input variables is a straightforward extension of the one input variable case provided in Equation 6.3.

The preceding discussion provided the approach to find the best linear fit to a set of observations. However, by making some additional assumptions on the error term, it is possible to provide further capabilities in utilizing the linear regression model. In general, these assumptions are almost always made, so the following model, built upon the earlier described model, is simply called the linear regression model.

Linear Regression Model (with Normally Distributed Errors)

In the previous model description, there were no assumptions made about the error term; no additional assumptions were necessary for OLS to provide estimates of the model parameters. However, in most linear regression analyses, it is common to assume that the error term is a normally distributed random variable with mean equal to zero and constant variance. Thus, the linear regression model is expressed as shown in Equation 6.4.

 

y = β0 + β1x1 + β2x2 +…+ βp-1xp-1 + ϵ

 

where:

1. y  is the outcome variable

2. xj are the input variables, for j = 1, 2, …, p – 1

3. β0 is the value of y when each xj equals zero

4. β1 is the change in y based on a unit change in xj, for j = 1, 2, …, p – 1

5. ϵ   ̴ N(0, σ2) and the Ẫ2 are independent of each other

This additional assumption yields the following result about the expected value of y, E(y) for given (x1, x2,…xp-1):

 



Figure 6.3 Normal distribution about y for a given value of x

For , one would expect to observe a value of near 20, but a value of y from 15 to 25 would appear possible based on the illustrated normal distribution. Thus, the regression model estimates the expected value of y for the given value of x. Additionally, the normality assumption on the error term provides some useful properties that can be utilized in performing hypothesis testing on the linear regression model and providing confidence intervals on the parameters and the mean of y given (x1, x2, … xp-1). The application of these statistical techniques is demonstrated by applying R to the earlier linear regression model on income.

Example in R

Returning to the Income example, in addition to the variables age and education, the person’s gender, female or male, is considered an input variable. The following code reads a comma-separated-value (CSV) file of 1,500 people’s incomes, ages, years of education, and gender. The first 10 rows are displayed:

income_input = as.data.frame( read.csv(“c:/data/income.csv”) )

income_input[1:10,]

ID Income Age Education Gender

1 1 113 69 12 1

2 2   91 52 18 0

3 3 121 65 14 0

4 4   81 58 12 0

5 5   68 31 16 1

6 6   92 51 15 1

7 7   75 53 15 0

8 8   76 56 13 0

9 9   56 42 15 1

10  10 53 33 11 1

Each person in the sample has been assigned an identification number, ID. Income is expressed in thousands of dollars. (For example, 113 denotes $113,000.) As described earlier, Age and Education are expressed in years. For Gender, a 0 denotes female and a 1 denotes male. A summary of the imported data reveals that the incomes vary from $14,000 to $134,000. The ages are between 18 and 70 years. The education experience for each person varies from a minimum of 10 years to a maximum of 20 years.

summary(income_input)

   ID Income Age Education

Min. : 1.0 Min. : 14.00 Min. :18.00 Min. :10.00

1st Qu.: 375.8 1st Qu.: 62.00 1st Qu.:30.00 1st Qu.:12.00

Median : 750.5 Median : 76.00 Median :44.00 Median :15.00

Mean : 750.5 Mean : 75.99 Mean :43.58 Mean :14.68

3rd Qu.:1125.2 3rd Qu.: 91.00 3rd Qu.:57.00 3rd Qu.:16.00

Max. :1500.0 Max. :134.00 Max. :70.00 Max. :20.00

   Gender

Min. :0.00

1st Qu.:0.00

Median :0.00

Mean :0.49

3rd Qu.:1.00

Max. :1.00

As described in Chapter 3, a scatterplot matrix is an informative tool to view the pair-wise relationships of the variables. The basic assumption of a linear regression model is that there is a linear relationship between the outcome variable and the input variables. Using the lattice package in R, the scatterplot matrix in Figure 6.4 is generated with the following R code:

library(lattice)

splom(˜income_input[c(2:5)], groups=NULL, data=income_input,

     axis.line.tck = 0,

     axis.text.alpha = 0)


Figure 6.4 Scatterplot matrix of the variables

Because the dependent variable is typically plotted along the y-axis, examine the set of scatterplots along the bottom of the matrix. A strong positive linear trend is observed for Income as a function of Age. Against Education, a slight positive trend may exist, but the trend is not quite as obvious as is the case with the Age variable. Lastly, there is no observed effect on Income based on Gender.

With this qualitative understanding of the relationships between Income and the input variables, it seems reasonable to quantitatively evaluate the linear relationships of these variables. Utilizing the normality assumption applied to the error term, the proposed linear regression model is shown in Equation 6.5.

 

Income = β0 + β1Age + β2 Education + β3 Gender + ϵ

 

Using the linear model function, lm(), in R, the income model can be applied to the data as follows:

results <- lm(Income˜Age + Education + Gender, income_input)

summary(results)

Call:

lm(formula = Income ˜ Age + Education + Gender, data = income_input)

Residuals:

Min 1Q Median 3Q Max

-37.340 -8.101 0.139 7.885 37.271

Coefficients:

        Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.26299 1.95575 3.714 0.000212 ***

Age 0.99520 0.02057 48.373 < 2e-16 ***

Education 1.75788 0.11581 15.179 < 2e-16 ***

Gender -0.93433 0.62388 -1.498 0.134443

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

Residual standard error: 12.07 on 1496 degrees of freedom

Multiple R-squared: 0.6364, Adjusted R-squared: 0.6357

F-statistic: 873 on 3 and 1496 DF, p-value: < 2.2e-16

The intercept term, β0, is implicitly included in the model. The lm() function performs the parameter estimation for the parameters βj(j = 0, 1, 2, 3) using ordinary least squares and provides several useful calculations and results that are stored in the variable called results in this example.

After the stated call to lm(), a few statistics on the residuals are displayed in the output. The residuals are the observed values of the error term for each of the n observations and are defined for i = 1, 2, …n, as shown in Equation 6.6.

ei = y – (b0 + b1xi,1 + b2xi,2 …. + bp-1xp-1

where bj denotes the estimate for parameter βj for j = 0, 1, 2, … p – 1

From the R output, the residuals vary from approximately –37 to +37, with a median close to 0. Recall that the residuals are assumed to be normally distributed with a mean near zero and a constant variance. The normality assumption is examined more carefully later.

The output provides details about the coefficients. The column Estimate provides the OLS estimates of the coefficients in the fitted linear regression model. In general, the (Intercept) corresponds to the estimated response variable when all the input variables equal zero. In this example, the intercept corresponds to an estimated income of $7,263 for a newborn female with no education. It is important to note that the available dataset does not include such a person. The minimum age and education in the dataset are 18 and 10 years, respectively. Thus, misleading results may be obtained when using a linear regression model to estimate outcomes for input values not representative within the dataset used to train the model.

The coefficient for Age is approximately equal to one. This coefficient is interpreted as follows: For every one unit increase in a person’s age, the person’s income is expected to increase by $995. Similarly, for every unit increase in a person’s years of education, the person’s income is expected to increase by about $1,758.

Interpreting the Gender coefficient is slightly different. When Gender is equal to zero, the Gender coefficient contributes nothing to the estimate of the expected income. When Gender is equal to one, the expected Income is decreased by about $934.

Because the coefficient values are only estimates based on the observed incomes in the sample, there is some uncertainty or sampling error for the coefficient estimates. The Std. Error column next to the coefficients provides the sampling error associated with each coefficient and can be used to perform a hypothesis test, using the t-distribution, to determine if each coefficient is statistically different from zero. In other words, if a coefficient is not statistically different from zero, the coefficient and the associated variable in the model should be excluded from the model. In this example, the associated hypothesis tests’ p-values, Pr(>|t|), are very small for the Intercept, Age, and Education parameters. As seen in Chapter 3, a small p-value corresponds to a small probability that such a large t value would be observed under the assumptions of the null hypothesis. In this case, for a given j = 0, 1, 2, …, p – 1, the null and alternate hypotheses follow:

 

Ho : βj = o  versus  HA : βj ≠ 0

 

For small p-values, as is the case for the Intercept, Age, and Education parameters, the null hypothesis would be rejected. For the Gender parameter, the corresponding p-value is fairly large at 0.13. In other words, at a 90% confidence level, the null hypothesis would not be rejected. So, dropping the variable Gender from the linear regression model should be considered. The following R code provides the modified model results:

results2 <- lm(Income ˜ Age + Education, income_input)

summary(results2)

Call:

lm(formula = Income ˜ Age + Education, data = income_input)

Residuals:

Min 1Q Median 3Q Max

-36.889 -7.892 0.185 8.200 37.740

Coefficients:

        Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.75822 1.92728 3.507 0.000467 ***

Age 0.99603 0.02057 48.412 < 2e-16 ***

Education 1.75860 0.11586 15.179 < 2e-16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

Residual standard error: 12.08 on 1497 degrees of freedom

Multiple R-squared: 0.6359, Adjusted R-squared: 0.6354

F-statistic: 1307 on 2 and 1497 DF, p-value: < 2.2e-16

Dropping the Gender variable from the model resulted in a minimal change to the estimates of the remaining parameters and their statistical significances.

The last part of the displayed results provides some summary statistics and tests on the linear regression model. The residual standard error is the standard deviation of the observed residuals. This value, along with the associated degrees of freedom, can be used to examine the variance of the assumed normally distributed error terms. R-squared (R 2 ) is a commonly reported metric that measures the variation in the data that is explained by the regression model. Possible values of R 2 vary from 0 to 1, with values closer to 1 indicating that the model is better at explaining the data than values closer to 0. An R 2 of exactly 1 indicates that the model explains perfectly the observed data (all the residuals are equal to 0). In general, the R 2 value can be increased by adding more variables to the model. However, just adding more variables to explain a given dataset but not to improve the explanatory nature of the model is known as overfitting. To address the possibility of overfitting the data, the adjusted R 2 accounts for the number of parameters included in the linear regression model.

The F-statistic provides a method for testing the entire regression model. In the previous t[1]tests, individual tests were conducted to determine the statistical significance of each parameter. The provided F-statistic and corresponding p-value enable the analyst to test the following hypotheses:

H0 : β1 = β2 =…- βp-1 = 0 HA : β1 ≠ 0

For at least one j = 1, 2, … p-1

In this example, the p-value of 2.2e – 16 is small, which indicates that the null hypothesis should be rejected.

Categorical Variables

In the previous example, the variable Gender was a simple binary variable that indicated whether a person is female or male. In general, these variables are known as categorical variables. To illustrate how to use categorical variables properly, suppose it was decided in the earlier Income example to include an additional variable, State, to represent the U.S. state where the person resides. Similar to the use of the Gender variable, one possible, but incorrect, approach would be to include a State variable that would take a value of 0 for Alabama, 1 for Alaska, 2 for Arizona, and so on. The problem with this approach is that such a numeric assignment based on an alphabetical ordering of the states does not provide a meaningful measure of the difference in the states. For example, is it useful or proper to consider Arizona to be one unit greater than Alaska and two units greater that Alabama?

In regression, a proper way to implement a categorical variable that can take on m different values is to add m-1 binary variables to the regression model. To illustrate with the Income example, a binary variable for each of 49 states, excluding Wyoming (arbitrarily chosen as the last of 50 states in an alphabetically sorted list), could be added to the model.

results3 <- lm(Income˜Age + Education,

+ Alabama,

+ Alaska,

+ Arizona,

.

.

.

+ WestVirginia,

+ Wisconsin,

income_input)

The input file would have 49 columns added for these variables representing each of the first 49 states. If a person was from Alabama, the Alabama variable would be equal to 1, and the other 48 variables would be set to 0. This process would be applied for the other state variables. So, a person from Wyoming, the one state not explicitly stated in the model, would be identified by setting all 49 state variables equal to 0. In this representation, Wyoming would be considered the reference case, and the regression coefficients of the other state variables would represent the difference in income between Wyoming and a particular state.

Confidence Intervals on the Parameters

Once an acceptable linear regression model is developed, it is often helpful to use it to draw some inferences about the model and the population from which the observations were drawn. Earlier, we saw that t-tests could be used to perform hypothesis tests on the individual model parameters, βj, j = 0, 1, …, p – 1. Alternatively, these t-tests could be expressed in terms of confidence intervals on the parameters. R simplifies the computation of confidence intervals on the parameters with the use of the confint() function. From the Income example, the following R command provides 95% confidence intervals on the intercept and the coefficients for the two variables, Age and Education.

confint(results2, level = .95)

 

2.5 % 97.5 %

(Intercept) 2.9777598 10.538690

Age 0.9556771 1.036392

Education 1.5313393 1.985862

Based on the data, the earlier estimated value of the Education coefficient was 1.76. Using confint(), the corresponding 95% confidence interval is (1.53, 1.99), which provides the amount of uncertainty in the estimate. In other words, in repeated random sampling, the computed confidence interval straddles the true but unknown coefficient 95% of the time. As expected from the earlier t-test results, none of these confidence intervals straddles zero.

Confidence Interval on the Expected Outcome

In addition to obtaining confidence intervals on the model parameters, it is often desirable to obtain a confidence interval on the expected outcome. In the Income example, the fitted linear regression provides the expected income for a given Age and Education. However, that particular point estimate does not provide information on the amount of uncertainty in that estimate. Using the predict() function in R, a confidence interval on the expected outcome can be obtained for a given set of input variable values.

In this illustration, a data frame is built containing a specific age and education value. Using this set of input variable values, the predict() function provides a 95% confidence interval on the expected Income for a 41-year-old person with 12 years of education.

Age <- 41

Education <- 12

new_pt <- data.frame(Age, Education)

conf_int_pt <- predict(results2,new_pt,level=.95,interval=“confidence”)

conf_int_pt

 

   fit lwr upr

1 68.69884 67.83102 69.56667

For this set of input values, the expected income is $68,699 with a 95% confidence interval of ($67,831, $69,567).

Prediction Interval on a Particular Outcome

The previous confidence interval was relatively close (+/– approximately $900) to the fitted value. However, this confidence interval should not be considered as representing the uncertainty in estimating a particular person’s income. The predict() function in R also provides the ability to calculate upper and lower bounds on a particular outcome. Such bounds provide what are referred to as prediction intervals. Returning to the Income example, in R the 95% prediction interval on the Income for a 41-year-old person with 12 years of education is obtained as follows:

pred_int_pt <- predict(results2,new_pt,level=.95,interval=“prediction”)

pred_int_pt

 

  fit lwr upr

1 68.69884 44.98867 92.40902

Again, the expected income is $68,699. However, the 95% prediction interval is ($44,988, $92,409). If the reason for this much wider interval is not obvious, recall that in Figure 6.3, for a particular input variable value, the expected outcome falls on the regression line, but the individual observations are normally distributed about the expected outcome. The confidence interval applies to the expected outcome that falls on the regression line, but the prediction interval applies to an outcome that may appear anywhere within the normal distribution.

Thus, in linear regression, confidence intervals are used to draw inferences on the population’s expected outcome, and prediction intervals are used to draw inferences on the next possible outcome.

6.1.3 Diagnostics

The use of hypothesis tests, confidence intervals, and prediction intervals is dependent on the model assumptions being true. The following discussion provides some tools and techniques that can be used to validate a fitted linear regression model.

Evaluating the Linearity Assumption

A major assumption in linear regression modeling is that the relationship between the input variables and the outcome variable is linear. The most fundamental way to evaluate such a relationship is to plot the outcome variable against each input variable. In the Income example, such scatterplots were generated in Figure 6.4. If the relationship between Age and Income is represented as illustrated in Figure 6.5, a linear model would not apply. In such a case, it is often useful to do any of the following:

• Transform the outcome variable.

• Transform the input variables.

• Add extra input variables or terms to the regression model.



Figure 6.5 Income as a quadratic function of Age

Common transformations include taking square roots or the logarithm of the variables. Another option is to create a new input variable such as the age squared and add it to the linear regression model to fit a quadratic relationship between an input variable and the outcome.

Additional use of transformations will be considered when evaluating the residuals.

Evaluating the Residuals

As stated previously, it is assumed that the error terms in the linear regression model are normally distributed with a mean of zero and a constant variance. If this assumption does not hold, the various inferences that were made with the hypothesis tests, confidence intervals, and prediction intervals are suspect.

To check for constant variance across all y values along the regression line, use a simple plot of the residuals against the fitted outcome values. Recall that the residuals are the difference between the observed outcome variables and the fitted value based on the OLS parameter estimates. Because of the importance of examining the residuals, the lm() function in R automatically calculates and stores the fitted values and the residuals, in the components fitted.values and residuals in the output of the lm() function. Using the Income regression model output stored in results2, Figure 6.6 was generated with the following R code:

with(results2, {

plot(fitted.values, residuals,ylim=c(-40,40) )

     points(c(min(fitted.values),max(fitted.values)),

c(0,0), type = “l”)})

 


Figure 6.6 Residual plot indicating constant variance

The plot in Figure 6.6 indicates that regardless of income value along the fitted linear regression model, the residuals are observed somewhat evenly on both sides of the reference zero line, and the spread of the residuals is fairly constant from one fitted value to the next. Such a plot would support the mean of zero and the constant variance assumptions on the error terms.

If the residual plot appeared like any of those in Figures 6.7 through 6.10, then some of the earlier discussed transformations or possible input variable additions should be considered and attempted. Figure 6.7 illustrates the existence of a nonlinear trend in the residuals. Figure 6.8 illustrates that the residuals are not centered on zero. Figure 6.9 indicates a linear trend in the residuals across the various outcomes along the linear regression model. This plot may indicate a missing variable or term from the regression model. Figure 6.10 provides an example in which the variance of the error terms is not a constant but increases along the fitted linear regression model.

 


Figure 6.7 Residuals with a nonlinear trend

 


Figure 6.8 Residuals not centered on the zero line

 


Figure 6.9 Residuals with a linear trend

 


Figure 6.10 Residuals with nonconstant variance

Evaluating the Normality Assumption

The residual plots are useful for confirming that the residuals were centered on zero and have a constant variance. However, the normality assumption still has to be validated. As shown in Figure 6.11, the following R code provides a histogram plot of the residuals from results2, the output from the Income example:

hist(results2$residuals, main=””)

 


Figure 6.11 Histogram of normally distributed residuals

From the histogram, it is seen that the residuals are centered on zero and appear to be symmetric about zero, as one would expect for a normally distributed random variable. Another option is to examine a Q-Q plot that compares the observed data against the quantiles (Q) of the assumed distribution. In R, the following code generates the Q-Q plot shown in Figure 6.12 for the residuals from the Income example and provides the line that the points should follow for values from a normal distribution.

qqnorm(results2$residuals, ylab=“Residuals”, main=””)

qqline(results2$residuals)

 


Figure 6.12 Q-Q plot of normally distributed residuals

A Q-Q plot as provided in Figure 6.13 would indicate that additional refinement of the model is required to achieve normally distributed error terms.

 



Figure 6.13 Q-Q plot of non-normally distributed residuals

N-Fold Cross-Validation

To prevent overfitting a given dataset, a common practice is to randomly split the entire dataset into a training set and a testing set. Once the model is developed on the training set, the model is evaluated against the testing set. When there is not enough data to create training and testing sets, an N-fold cross-validation technique may be helpful to compare one fitted model against another. In N-fold cross-validation, the following occurs:

• The entire dataset is randomly split into N datasets of approximately equal size.

• A model is trained against N – 1 of these datasets and tested against the remaining dataset. A measure of the model error is obtained.

• This process is repeated a total of N times across the various combinations of N datasets taken N – 1 at a time. Recall:

 


• The observed N model errors are averaged over the N folds.

The averaged error from one model is compared against the averaged error from another model. This technique can also help determine whether adding more variables to an existing model is beneficial or possibly overfitting the data.

Other Diagnostic Considerations

Although a fitted linear regression model conforms with the preceding diagnostic criteria, it is possible to improve the model by including additional input variables not yet considered. In the previous Income example, only three possible input variables—Age, Education, and Gender—were considered. Dozens of other additional input variables such as Housing or Marital_Status may improve the fitted model. It is important to consider all possible input variables early in the analytic process.

As mentioned earlier, in reviewing the R output from fitting a linear regression model, the adjusted R 2 applies a penalty to the R 2 value based on the number of parameters added to the model. Because the R 2 value will always move closer to one as more variables are added to an existing regression model, the adjusted R 2 value may actually decrease after adding more variables.

The residual plots should be examined for any outliers, observed points that are markedly different from the majority of the points. Outliers can result from bad data collection, data processing errors, or an actual rare occurrence. In the Income example, suppose that an individual with an income of a million dollars was included in the dataset. Such an observation could affect the fitted regression model, as seen in one of the examples of Anscombe’s Quartet.

Finally, the magnitudes and signs of the estimated parameters should be examined to see if they make sense. For example, suppose a negative coefficient for the Education variable in the Income example was obtained. Because it is natural to assume that more years of education lead to higher incomes, either something very unexpected has been discovered, or there is some issue with the model, how the data was collected, or some other factor. In either case, further investigation is warranted.


No comments:

Post a Comment

Tell your requirements and How this blog helped you.

Labels

ACTUATORS (10) AIR CONTROL/MEASUREMENT (38) ALARMS (20) ALIGNMENT SYSTEMS (2) Ammeters (12) ANALYSERS/ANALYSIS SYSTEMS (33) ANGLE MEASUREMENT/EQUIPMENT (5) APPARATUS (6) Articles (3) AUDIO MEASUREMENT/EQUIPMENT (1) BALANCES (4) BALANCING MACHINES/SERVICES (1) BOILER CONTROLS/ACCESSORIES (5) BRIDGES (7) CABLES/CABLE MEASUREMENT (14) CALIBRATORS/CALIBRATION EQUIPMENT (19) CALIPERS (3) CARBON ANALYSERS/MONITORS (5) CHECKING EQUIPMENT/ACCESSORIES (8) CHLORINE ANALYSERS/MONITORS/EQUIPMENT (1) CIRCUIT TESTERS CIRCUITS (2) CLOCKS (1) CNC EQUIPMENT (1) COIL TESTERS EQUIPMENT (4) COMMUNICATION EQUIPMENT/TESTERS (1) COMPARATORS (1) COMPASSES (1) COMPONENTS/COMPONENT TESTERS (5) COMPRESSORS/COMPRESSOR ACCESSORIES (2) Computers (1) CONDUCTIVITY MEASUREMENT/CONTROL (3) CONTROLLERS/CONTROL SYTEMS (35) CONVERTERS (2) COUNTERS (4) CURRENT MEASURMENT/CONTROL (2) Data Acquisition Addon Cards (4) DATA ACQUISITION SOFTWARE (5) DATA ACQUISITION SYSTEMS (22) DATA ANALYSIS/DATA HANDLING EQUIPMENT (1) DC CURRENT SYSTEMS (2) DETECTORS/DETECTION SYSTEMS (3) DEVICES (1) DEW MEASURMENT/MONITORING (1) DISPLACEMENT (2) DRIVES (2) ELECTRICAL/ELECTRONIC MEASUREMENT (3) ENCODERS (1) ENERGY ANALYSIS/MEASUREMENT (1) EQUIPMENT (6) FLAME MONITORING/CONTROL (5) FLIGHT DATA ACQUISITION and ANALYSIS (1) FREQUENCY MEASUREMENT (1) GAS ANALYSIS/MEASURMENT (1) GAUGES/GAUGING EQUIPMENT (15) GLASS EQUIPMENT/TESTING (2) Global Instruments (1) Latest News (35) METERS (1) SOFTWARE DATA ACQUISITION (2) Supervisory Control - Data Acquisition (1)