Industries Needs: Data Science and Big Data Analytics

Discovering, Analyzing, Visualizing and Presenting Data

Review of Basic Data Analytic Methods Using R

3.3 Statistical Methods for Evaluation

Visualization is useful for data exploration and presentation, but statistics is crucial because it may exist throughout the entire Data Analytics Lifecycle. Statistical techniques are used during the initial data exploration and data preparation, model building, evaluation of the final models, and assessment of how the new models improve the situation when deployed in the field. In particular, statistics can help answer the following questions for data analytics:

• Model Building and Planning

◦ What are the best input variables for the model?

◦ Can the model predict the outcome given the input?

• Model Evaluation

◦ Is the model accurate?

◦ Does the model perform better than an obvious guess?

◦ Does the model perform better than another candidate model?

• Model Deployment

◦ Is the prediction sound?

◦ Does the model have the desired effect (such as reducing the cost)?

This section discusses some useful statistical tools that may answer these questions.

3.3.1 Hypothesis Testing

When comparing populations, such as testing or evaluating the difference of the means from two samples of data (Figure 3.22), a common technique to assess the difference or the significance of the difference is hypothesis testing.

Figure 3.22 Distributions of two samples of data

The basic concept of hypothesis testing is to form an assertion and test it with data. When performing hypothesis tests, the common assumption is that there is no difference between two samples. This assumption is used as the default position for building the test or conducting a scientific experiment. Statisticians refer to this as the null hypothesis ( ^HO). The alternative hypothesis (^HA ) is that there is a difference between two samples. For example, if the task is to identify the effect of drug A compared to drug B on patients, the null hypothesis and alternative hypothesis would be this.

• ^HO: Drug A and drug B have the same effect on patients.

• ^HA: Drug A has a greater effect than drug B on patients.

If the task is to identify whether advertising Campaign C is effective on reducing customer churn, the null hypothesis and alternative hypothesis would be as follows.

• ^HO: Campaign C does not reduce customer churn better than the current campaign method.

• ^HA: Campaign C does reduce customer churn better than the current campaign.

It is important to state the null hypothesis and alternative hypothesis, because misstating them is likely to undermine the subsequent steps of the hypothesis testing process. A hypothesis test leads to either rejecting the null hypothesis in favor of the alternative or not rejecting the null hypothesis.

Table 3.5 includes some examples of null and alternative hypotheses that should be answered during the analytic lifecycle.

Table 3.5 Example Null Hypotheses and Alternative Hypotheses

Once a model is built over the training data, it needs to be evaluated over the testing data to see if the proposed model predicts better than the existing model currently being used. The null hypothesis is that the proposed model does not predict better than the existing model. The alternative hypothesis is that the proposed model indeed predicts better than the existing model. In accuracy forecast, the null model could be that the sales of the next month are the same as the prior month. The hypothesis test needs to evaluate if the proposed model provides a better prediction. Take a recommendation engine as an example. The null hypothesis could be that the new algorithm does not produce better recommendations than the current algorithm being deployed. The alternative hypothesis is that the new algorithm produces better recommendations than the old algorithm.

When evaluating a model, sometimes it needs to be determined if a given input variable improves the model. In regression analysis (Chapter 6), for example, this is the same as asking if the regression coefficient for a variable is zero. The null hypothesis is that the coefficient is zero, which means the variable does not have an impact on the outcome. The alternative hypothesis is that the coefficient is nonzero, which means the variable does have an impact on the outcome.

A common hypothesis test is to compare the means of two populations. Two such hypothesis tests are discussed in Section 3.3.2.

3.3.2 Difference of Means

Hypothesis testing is a common approach to draw inferences on whether or not the two populations, denoted and , are different from each other. This section provides two hypothesis tests to compare the means of the respective populations based on samples randomly drawn from each population. Specifically, the two hypothesis tests in this section consider the following null and alternative hypotheses.

• ^HO: µ₁=µ₂

• ^HA: µ₁=µ₂

The µ₁and µ₂denote the population means of pop¹ and pop², respectively.

The basic testing approach is to compare the observed sample means, Ẍ₁ and Ẍ₂, corresponding to each population. If the values of and are approximately equal to each other, the distributions of Ẍ₁and Ẍ₂ overlap substantially (Figure 3.23), and the null hypothesis is supported. A large observed difference between the sample means indicates that the null hypothesis should be rejected. Formally, the difference in means can be tested using Student’s t-test or the Welch’s t-test.

Student’s t-test

Student’s t-test assumes that distributions of the two populations have equal but unknown variances. Suppose and samples are randomly and independently selected from two populations, pop¹ and pop² , respectively. If each population is normally distributed with the same mean (µ₁=µ₂) and with the same variance, then T (the t-statistic), given in Equation 3.1, follows a t-distribution with degrees of freedom (df).

The shape of the t-distribution is similar to the normal distribution. In fact, as the degrees of freedom approaches 30 or more, the t-distribution is nearly identical to the normal distribution. Because the numerator of T is the difference of the sample means, if the observed value of T is far enough from zero such that the probability of observing such a value of T is unlikely, one would reject the null hypothesis that the population means are equal. Thus, for a small probability, say α=0.05, T° is determined such that. After the samples are collected and the observed value of T is calculated according to Equation 3.1, the null hypothesis (µ₁=µ₂) is rejected if In hypothesis testing, in general, the small probability, , is known as the significance level of the test. The significance level of the test is the probability of rejecting the null hypothesis, when the null hypothesis is actually TRUE. In other words, for α=0.05, if the means from the two populations are truly equal, then in repeated random sampling, the observed magnitude of T° would only exceed T° 5% of the time.

In the following R code example, 10 observations are randomly selected from two normally distributed populations and assigned to the variables x and y. The two populations have a mean of 100 and 105, respectively, and a standard deviation equal to 5. Student’s t-test is then conducted to determine if the obtained random samples support the rejection of the null hypothesis.

# generate random observations from the two populations

x <- rnorm(10, mean=100, sd=5) # normal distribution centered at 100

y <- rnorm(20, mean=105, sd=5) # normal distribution centered at 105

t.test(x, y, var.equal=TRUE) # run the Student’s t-test

Two Sample t-test

data: x and y

t = -1.7828, df = 28, p-value = 0.08547

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-6.1611557 0.4271893

sample estimates:

mean of x mean of y

102.2136 105.0806

From the R output, the observed value of T is t=-1.7828. The negative sign is due to the fact that the sample mean of x is less than the sample mean of y. Using the qt() function in R, a T value of 2.0484 corresponds to a 0.05 significance level.

# obtain t value for a two-sided test at a 0.05 significance level

qt(p=0.05/2, df=28, lower.tail= FALSE)

2.048407

Because the magnitude of the observed T statistic is less than the T value corresponding to the 0.05 significance level (|-1.7828|<2.0484 ), the null hypothesis is not rejected. Because the alternative hypothesis is that the means are not equal (µ≠µ ), the possibilities of both µ>µ and µ<µ need to be considered. This form of Student’s t-test is known as a two-sided hypothesis test, and it is necessary for the sum of the probabilities under both tails of the t[1]distribution to equal the significance level. It is customary to evenly divide the significance level between both tails. So, p-0.05/2-0.025 was used in the qt() function to obtain the appropriate t-value.

To simplify the comparison of the t-test results to the significance level, the R output includes a quantity known as the p-value. In the preceding example, the p-value is 0.08547, which is the sum ofand. Figure 3.24 illustrates the t-statistic for the area under the tail of a t-distribution. The -t and t are the observed values of the t-statistic. In the R output, . The left shaded area corresponds to the , and the right shaded area corresponds to the

.

Figure 3.24 Area under the tails (shaded) of a student’s t-distribution

In the R output, for a significance level of 0.05, the null hypothesis would not be rejected because the likelihood of a T value of magnitude 1.7828 or greater would occur at higher probability than 0.05. However, based on the p-value, if the significance level was chosen to be 0.10, instead of 0.05, the null hypothesis would be rejected. In general, the p-value offers the probability of observing such a sample result given the null hypothesis is TRUE.

A key assumption in using Student’s t-test is that the population variances are equal. In the previous example, the t.test() function call includes var.equal=TRUE to specify that equality of the variances should be assumed. If that assumption is not appropriate, then Welch’s t-test should be used.

Welch’s t-test

When the equal population variance assumption is not justified in performing Student’s t-test for the difference of means, Welch’s t-test [14] can be used based on T expressed in Equation 3.2.

In Welch’s test, under the remaining assumptions of random samples from two normal populations with the same mean, the distribution of T is approximated by the t-distribution. The following R code performs the Welch’s t-test on the same set of data analyzed in the earlier Student’s t-test example.

t.test(x, y, var.equal=FALSE) # run the Welch’s t-test

Welch Two Sample t-test

data: x and y

t = -1.6596, df = 15.118, p-value = 0.1176

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-6.546629 0.812663

sample estimates:

mean of x mean of y

102.2136 105.0806

In this particular example of using Welch’s t-test, the p-value is 0.1176, which is greater than the p-value of 0.08547 observed in the Student’s t-test example. In this case, the null hypothesis would not be rejected at a 0.10 or 0.05 significance level.

It should be noted that the degrees of freedom calculation is not as straightforward as in the Student’s t-test. In fact, the degrees of freedom calculation often results in a non[1]integer value, as in this example. The degrees of freedom for Welch’s t-test is defined in Equation 3.3.

In both the Student’s and Welch’s t-test examples, the R output provides 95% confidence intervals on the difference of the means. In both examples, the confidence intervals straddle zero. Regardless of the result of the hypothesis test, the confidence interval provides an interval estimate of the difference of the population means, not just a point estimate.

A confidence interval is an interval estimate of a population parameter or characteristic based on sample data. A confidence interval is used to indicate the uncertainty of a point estimate. If ẍ is the estimate of some unknown population mean µ, the confidence interval provides an idea of how close ẍ is to the unknown µ. For example, a 95% confidence interval for a population mean straddles the TRUE, but unknown mean 95% of the time. Consider Figure 3.25 as an example. Assume the confidence level is 95%. If the task is to estimate the mean of an unknown value in a normal distribution with known standard

deviation σ and the estimate based on n observations is х, then the interval straddles the unknown

Figure 3.25 A 95% confidence interval straddling the unknown population mean μ

value of μ with about a 95% chance. If one takes 100 different samples and computes the 95% confidence interval for the mean, 95 of the 100 confidence intervals will be expected to straddle the population mean μ.

Confidence intervals appear again in Section 3.3.6 on ANOVA. Returning to the discussion of hypothesis testing, a key assumption in both the Student’s and Welch’s t-test is that the relevant population attribute is normally distributed. For non-normally distributed data, it is sometimes possible to transform the collected data to approximate a normal distribution. For example, taking the logarithm of a dataset can often transform skewed data to a dataset that is at least symmetric around its mean. However, if such transformations are ineffective, there are tests like the Wilcoxon rank-sum test that can be applied to see if two population distributions are different.

3.3.3 Wilcoxon Rank-Sum Test

A t-test represents a parametric test in that it makes assumptions about the population distributions from which the samples are drawn. If the populations cannot be assumed or transformed to follow a normal distribution, a nonparametric test can be used. The Wilcoxon rank-sum test [15] is a nonparametric hypothesis test that checks whether two populations are identically distributed. Assuming the two populations are identically distributed, one would expect that the ordering of any sampled observations would be evenly intermixed among themselves. For example, in ordering the observations, one would not expect to see a large number of observations from one population grouped together, especially at the beginning or the end of ordering.

Let the two populations again be pop1 and pop2, with independently random samples of size and respectively. The total number of observations is then . The first step of the Wilcoxon test is to rank the set of observations from the two groups as if they came from one large group. The smallest observation receives a rank of 1, the second smallest observation receives a rank of 2, and so on with the largest observation being assigned the rank of N. Ties among the observations receive a rank equal to the average of the ranks they span. The test uses ranks instead of numerical outcomes to avoid specific assumptions about the shape of the distribution.

After ranking all the observations, the assigned ranks are summed for at least one population’s sample. If the distribution of pop1 is shifted to the right of the other distribution, the rank-sum corresponding to pop1‘s sample should be larger than the rank[1]sum of pop2. The Wilcoxon rank-sum test determines the significance of the observed rank-sums. The following R code performs the test on the same dataset used for the previous t-test.

wilcox.test(x, y, conf.int = TRUE)

Wilcoxon rank sum test

data: x and y

W = 55, p-value = 0.04903

alternative hypothesis: true location shift is not equal to 0

95 percent confidence interval:

-6.2596774 -0.1240618

sample estimates:

difference in location

-3.417658

The wilcox.test() function ranks the observations, determines the respective rank-sums corresponding to each population’s sample, and then determines the probability of such rank-sums of such magnitude being observed assuming that the population distributions are identical. In this example, the probability is given by the p-value of 0.04903. Thus, the null hypothesis would be rejected at a 0.05 significance level. The reader is cautioned against interpreting that one hypothesis test is clearly better than another test based solely on the examples given in this section.

Because the Wilcoxon test does not assume anything about the population distribution, it is generally considered more robust than the t-test. In other words, there are fewer assumptions to violate. However, when it is reasonable to assume that the data is normally distributed, Student’s or Welch’s t-test is an appropriate hypothesis test to consider.

3.3.4 Type I and Type II Errors

A hypothesis test may result in two types of errors, depending on whether the test accepts or rejects the null hypothesis. These two errors are known as type I and type II errors.

• A type I error is the rejection of the null hypothesis when the null hypothesis is TRUE. The probability of the type I error is denoted by the Greek letter α.

• A type II error is the acceptance of a null hypothesis when the null hypothesis is FALSE. The probability of the type II error is denoted by the Greek letter β.

Table 3.6 lists the four possible states of a hypothesis test, including the two types of errors.

Table 3.6 Type I and Type II Error

The significance level, as mentioned in the Student’s t-test discussion, is equivalent to the type I error. For a significance level such as , if the null hypothesis (µ=µ) is TRUE, there is a 5% chance that the observed T value based on the sample data will be large enough to reject the null hypothesis. By selecting an appropriate significance level, the probability of committing a type I error can be defined before any data is collected or analyzed.

The probability of committing a Type II error is somewhat more difficult to determine. If two population means are truly not equal, the probability of committing a type II error will depend on how far apart the means truly are. To reduce the probability of a type II error to a reasonable level, it is often necessary to increase the sample size. This topic is addressed in the next section.

3.3.5 Power and Sample Size

The power of a test is the probability of correctly rejecting the null hypothesis. It is denoted by 1-β, where is the probability of a type II error. Because the power of a test improves as the sample size increases, power is used to determine the necessary sample size. In the difference of means, the power of a hypothesis test depends on the true difference of the population means. In other words, for a fixed significance level, a larger sample size is required to detect a smaller difference in the means. In general, the magnitude of the difference is known as the effect size. As the sample size becomes larger, it is easier to detect a given effect size,δ , as illustrated in Figure 3.26.

Figure 3.26 A larger sample size better identifies a fixed effect size

With a large enough sample size, almost any effect size can appear statistically significant. However, a very small effect size may be useless in a practical sense. It is important to consider an appropriate effect size for the problem at hand.

3.3.6 ANOVA

The hypothesis tests presented in the previous sections are good for analyzing means between two populations. But what if there are more than two populations? Consider an example of testing the impact of nutrition and exercise on 60 candidates between age 18 and 50. The candidates are randomly split into six groups, each assigned with a different weight loss strategy, and the goal is to determine which strategy is the most effective.

• Group 1 only eats junk food.

• Group 2 only eats healthy food.

• Group 3 eats junk food and does cardio exercise every other day.

• Group 4 eats healthy food and does cardio exercise every other day.

• Group 5 eats junk food and does both cardio and strength training every other day.

• Group 6 eats healthy food and does both cardio and strength training every other day.

Multiple t-tests could be applied to each pair of weight loss strategies. In this example, the weight loss of Group 1 is compared with the weight loss of Group 2, 3, 4, 5, or 6. Similarly, the weight loss of Group 2 is compared with that of the next 4 groups. Therefore, a total of 15 t-tests would be performed.

However, multiple t-tests may not perform well on several populations for two reasons. First, because the number of t-tests increases as the number of groups increases, analysis using the multiple t-tests becomes cognitively more difficult. Second, by doing a greater number of analyses, the probability of committing at least one type I error somewhere in the analysis greatly increases.

Analysis of Variance (ANOVA) is designed to address these issues. ANOVA is a generalization of the hypothesis testing of the difference of two population means. ANOVA tests if any of the population means differ from the other population means. The null hypothesis of ANOVA is that all the population means are equal. The alternative hypothesis is that at least one pair of the population means is not equal. In other words,

• ^HO: µ₁=µ_2=…= µ_n

• ^HA: µ₁₌ µ₂ for at least one pair of i, j

As seen in Section 3.3.2, “Difference of Means,” each population is assumed to be normally distributed with the same variance.

The first thing to calculate for the ANOVA is the test statistic. Essentially, the goal is to test whether the clusters formed by each population are more tightly grouped than the spread across all the populations.

Let the total number of populations be k. The total number of samples N is randomly split into the k groups. The number of samples in the i -th group is denoted as ⁿi, and the mean of the group is ^Ẍi where iϵ[1,k]. The mean of all the samples is denoted as ^Ẍo.

The F-test statistic in ANOVA can be thought of as a measure of how different the means are relative to the variability within each group. The larger the observed F-test statistic, the greater the likelihood that the differences between the means are due to something other than chance alone. The F-test statistic is used to test the hypothesis that the observed effects are not due to chance—that is, if the means are significantly different from one another.

Consider an example that every customer who visits a retail website gets one of two promotional offers or gets no promotion at all. The goal is to see if making the promotional offers makes a difference. ANOVA could be used, and the null hypothesis is that neither promotion makes a difference. The code that follows randomly generates a total of 500 observations of purchase sizes on three different offer options.

offers <- sample(c(“offer1”, “offer2”, “nopromo”), size=500, replace=T)

# Simulated 500 observations of purchase sizes on the 3 offer options

purchasesize <- ifelse(offers==“offer1”, rnorm(500, mean=80, sd=30),

ifelse(offers==“offer2”, rnorm(500, mean=85, sd=30),

rnorm(500, mean=40, sd=30)))

# create a data frame of offer option and purchase size

offertest <- data.frame(offer=as.factor(offers),

purchase_amt=purchasesize)

The summary of the offertest data frame shows that 170 offer1, 161 offer2, and 169 nopromo (no promotion) offers have been made. It also shows the range of purchase size (purchase_amt) for each of the three offer options.

bla # display a summary of offertest where offer=“offer1”

summary(offertest[offertest$offer==“offer1”,])

offer purchase_amt

nopromo: 0 Min. : 4.521

offer1 :170 1st Qu.: 58.158

offer2 : 0 Median : 76.944

Mean : 81.936

3rd Qu.:104.959

Max. :180.507

bla # display a summary of offertest where offer=“offer2”

summary(offertest[offertest$offer==“offer2”,])

offer purchase_amt

nopromo: 0 Min. : 14.04

offer1 : 0 1st Qu.: 69.46

offer2 :161 Median : 90.20

Mean : 89.09

3rd Qu.:107.48

Max. :154.33

bla # display a summary of offertest where offer=“nopromo”

summary(offertest[offertest$offer==“nopromo”,])

offer purchase_amt

nopromo:169 Min. :-27.00

offer1 : 0 1st Qu.: 20.22

offer2 : 0 Median : 42.44

Mean : 40.97

3rd Qu.: 58.96

Max. :164.04

The aov() function performs the ANOVA on purchase size and offer options.

bla # fit ANOVA test

model <- aov(purchase_amt ˜ offers, data=offertest)

The summary() function shows a summary of the model. The degrees of freedom for offers is 2, which corresponds to the in the denominator of Equation 3.4. The degrees of freedom for residuals is 497, which corresponds to the in the denominator of Equation 3.5.

summary(model)

Df Sum Sq Mean Sq F value Pr(>F)

offers 2 225222 112611 130.6

Residuals 497 428470 862

–

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

he output also includes the S²_B(112,611), S²_W(862), the F-test statistic (130.6), and the p-value (< 2e–16). The F-test statistic is much greater than 1 with a p-value much less than 1. Thus, the null hypothesis that the means are equal should be rejected.

However, the result does not show whether offer1 is different from offer2, which requires additional tests. The TukeyHSD() function implements Tukey’s Honest Significant Difference (HSD) on all pair-wise tests for difference of means.

TukeyHSD(model)

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = purchase_amt ˜ offers, data = offertest)

$offers

diff lwr upr p adj

offer1-nopromo 40.961437 33.4638483 48.45903 0.0000000

offer2-nopromo 48.120286 40.5189446 55.72163 0.0000000

offer2-offer1 7.158849 -0.4315769 14.74928 0.0692895

The result includes p-values of pair-wise comparisons of the three offer options. The p[1]values for offer1-nopromo and offer-nopromo are equal to 0, smaller than the significance level 0.05. This suggests that both offer1 and offer2 are significantly different from nopromo. A p-value of 0.0692895 for offer2 against offer1 is greater than the significance level 0.05. This suggests that offer2 is not significantly different from offer1.

Because only the influence of one factor (offers) was executed, the presented ANOVA is known as one-way ANOVA. If the goal is to analyze two factors, such as offers and day of week, that would be a two-way ANOVA [16]. If the goal is to model more than one outcome variable, then multivariate ANOVA (or MANOVA) could be used.

Summary

R is a popular package and programming language for data exploration, analytics, and visualization. As an introduction to R, this chapter covers the R GUI, data I/O, attribute and data types, and descriptive statistics. This chapter also discusses how to use R to perform exploratory data analysis, including the discovery of dirty data, visualization of one or more variables, and customization of visualization for different audiences. Finally, the chapter introduces some basic statistical methods. The first statistical method presented in the chapter is the hypothesis testing. The Student’s t-test and Welch’s t-test are included as two example hypothesis tests designed for testing the difference of means. Other statistical methods and tools presented in this chapter include confidence intervals, Wilcoxon rank-sum test, type I and II errors, effect size, and ANOVA.

Exercises

1. How many levels does fdata contain in the following R code?

data = c(1,2,2,3,1,2,3,3,1,2,3,3,1)

fdata = factor(data)

2. Two vectors, v1 and v2, are created with the following R code:

v1 <- 1:5

v2 <- 6:2

What are the results of cbind(v1,v2) and rbind(v1,v2)?

3. What R command(s) would you use to remove null values from a dataset?

4. What R command can be used to install an additional R package?

5. What R function is used to encode a vector as a category?

6. What is a rug plot used for in a density plot?

7. An online retailer wants to study the purchase behaviors of its customers. Figure 3.27 shows the density plot of the purchase sizes (in dollars). What would be your recommendation to enhance the plot to detect more structures that otherwise might be missed?

Figure 3.27 Density plot of purchase size

8. How many sections does a box-and-whisker divide the data into? What are these sections?

9. What attributes are correlated according to Figure 3.18? How would you describe their relationships?

10. What function can be used to fit a nonlinear line to the data?

11. If a graph of data is skewed and all the data is positive, what mathematical technique may be used to help detect structures that might otherwise be overlooked?

12. What is a type I error? What is a type II error? Is one always more serious than the other? Why?

13. Suppose everyone who visits a retail website gets one promotional offer or no promotion at all. We want to see if making a promotional offer makes a difference. What statistical method would you recommend for this analysis?

14. You are analyzing two normally distributed populations, and your null hypothesis is that the mean of the first population is equal to the mean of the second. Assume the significance level is set at 0.05. If the observed p-value is 4.33e-05, what will be your decision regarding the null hypothesis?

Bibliography

1. [1] The R Project for Statistical Computing, “R Licenses.” [Online]. Available: http://www.r-project.org/Licenses/ . [Accessed 10 December 2013].

2. [2] The R Project for Statistical Computing, “The Comprehensive R Archive Network.” [Online]. Available: http://cran.r-project.org/ . [Accessed 10 December 2013].

3. [3] J. Fox and M. Bouchet-Valat, “The R Commander: A Basic-Statistics GUI for R,” CRAN. [Online]. Available: http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/ . [Accessed 11 December 2013].

4. [4] G. Williams, M. V. Culp, E. Cox, A. Nolan, D. White, D. Medri, and A. Waljee, “Rattle: Graphical User Interface for Data Mining in R,” CRAN. [Online]. Available: http://cran.r-project.org/web/packages/rattle/index.html . [Accessed 12 December 2013].

5. [5] RStudio, “RStudio IDE” [Online]. Available: http://www.rstudio.com/ide/ . [Accessed 11 December 2013].

6. [6] R Special Interest Group on Databases (R-SIG-DB), “DBI: R Database Interface.” CRAN [Online]. Available: http://cran.rproject.org/web/packages/DBI/index.html . [Accessed 13 December 2013].

7. [7] B. Ripley, “RODBC: ODBC Database Access,” CRAN. [Online]. Available: http://cran.r-project.org/web/packages/RODBC/index.html . [Accessed 13 December 2013].

8. [8] S. S. Stevens, “On the Theory of Scales of Measurement,” Science, vol. 103, no. 2684, p. 677–680, 1946.

9. [9] D. C. Hoaglin, F. Mosteller, and J. W. Tukey, Understanding Robust and Exploratory Data Analysis, New York: Wiley, 1983.

10. [10] F. J. Anscombe, “Graphs in Statistical Analysis,” The American Statistician, vol. 27, no. 1, pp. 17–21, 1973.

11. [11] H. Wickham, “ggplot2,” 2013. [Online]. Available: http://ggplot2.org/ . [Accessed 8 January 2014].

12. [12] W. S. Cleveland, Visualizing Data, Lafayette, IN: Hobart Press, 1993.

13. [13] R. A. Fisher, “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics, vol. 7, no. 2, pp. 179–188, 1936.

14. [14] B. L. Welch, “The Generalization of “Student’s” Problem When Several Different Population Variances Are Involved,” Biometrika, vol. 34, no. 1–2, pp. 28– 35, 1947.

15. [15] F. Wilcoxon, “Individual Comparisons by Ranking Methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945.

16. [16] J. J. Faraway, “Practical Regression and Anova Using R,” July 2002. [Online]. Available: http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf . [Accessed 22 January 2014].

Saturday, March 5, 2022

Data Science and Big Data Analytics

No comments:

Post a Comment

Labels

INSTRUMENTATION MANUFACTURERS