google.com, pub-4497197638514141, DIRECT, f08c47fec0942fa0 Industries Needs: Data Science and Big Data Analytics

Wednesday, February 23, 2022

Data Science and Big Data Analytics

  

Discovering, Analyzing, Visualizing and Presenting Data

 

Review of Basic Data Analytic Methods Using R

Key Concepts

1. Basic features of R

2. Data exploration and analysis with R

3. Statistical methods for evaluation

The previous chapter presented the six phases of the Data Analytics Lifecycle.

• Phase 1: Discovery

• Phase 2: Data Preparation

• Phase 3: Model Planning

• Phase 4: Model Building

• Phase 5: Communicate Results

• Phase 6: Operationalize

The first three phases involve various aspects of data exploration. In general, the success of a data analysis project requires a deep understanding of the data. It also requires a toolbox for mining and presenting the data. These activities include the study of the data in terms of basic statistical measures and creation of graphs and plots to visualize and identify relationships and patterns. Several free or commercial tools are available for exploring, conditioning, modeling, and presenting data. Because of its popularity and versatility, the open-source programming language R is used to illustrate many of the presented analytical tasks and models in this book.

This chapter introduces the basic functionality of the R programming language and environment. The first section gives an overview of how to use R to acquire, parse, and filter the data as well as how to obtain some basic descriptive statistics on a dataset. The second section examines using R to perform exploratory data analysis tasks using visualization. The final section focuses on statistical inference, such as hypothesis testing and analysis of variance in R.

 

3.1 Introduction to R

R is a programming language and software framework for statistical analysis and graphics. Available for use under the GNU General Public License [1], R software and installation instructions can be obtained via the Comprehensive R Archive and Network [2]. This section provides an overview of the basic functionality of R. In later chapters, this foundation in R is utilized to demonstrate many of the presented analytical techniques.

Before delving into specific operations and functions of R later in this chapter, it is important to understand the flow of a basic R script to address an analytical problem. The following R code illustrates a typical analytical situation in which a dataset is imported, the contents of the dataset are examined, and some modeling building tasks are executed. Although the reader may not yet be familiar with the R syntax, the code can be followed by reading the embedded comments, denoted by #. In the following scenario, the annual sales in U.S. dollars for 10,000 retail customers have been provided in the form of a comma-separated-value (CSV) file. The read.csv() function is used to import the CSV file. This dataset is stored to the R variable sales using the assignment operator <-.

# import a CSV file of the total annual sales for each customer

sales <- read.csv(“c:/data/yearly_sales.csv”)

# examine the imported dataset

head(sales)

summary(sales)

# plot num_of_orders vs. sales

plot(sales$num_of_orders,sales$sales_total,

main=“Number of Orders vs. Sales”)

# perform a statistical analysis (fit a linear regression model)

results <- lm(sales$sales_total ˜ sales$num_of_orders)

summary(results)

# perform some diagnostics on the fitted model

# plot histogram of the residuals

hist(results$residuals, breaks = 800)

In this example, the data file is imported using the read.csv() function. Once the file has been imported, it is useful to examine the contents to ensure that the data was loaded properly as well as to become familiar with the data. In the example, the head() function, by default, displays the first six records of sales.

# examine the imported dataset

head(sales)

cust_id sales_total num_of_orders gender

1 100001 800.64 3 F

2 100002 217.53 3 F

3 100003 74.58 2 M

4 100004 498.60 3 M

5 100005 723.11 4 F

6 100006 69.43 2 F

The summary() function provides some descriptive statistics, such as the mean and median, for each data column. Additionally, the minimum and maximum values as well as the 1st and 3rd quartiles are provided. Because the gender column contains two possible characters, an “F” (female) or “M” (male), the summary() function provides the count of each character’s occurrence.

summary(sales)

cust_id sales_total num_of_orders gender

Min. :100001 Min. : 30.02 Min. : 1.000 F:5035

1st Qu.:102501 1st Qu.: 80.29 1st Qu.: 2.000 M:4965

Median :105001 Median : 151.65 Median : 2.000

Mean :105001 Mean : 249.46 Mean : 2.428

3rd Qu.:107500 3rd Qu.: 295.50 3rd Qu.: 3.000

Max. :110000 Max. :7606.09 Max. :22.000

Plotting a dataset’s contents can provide information about the relationships between the various columns. In this example, the plot() function generates a scatterplot of the number of orders (sales$num_of_orders) against the annual sales (sales$sales_total). The $ is used to reference a specific column in the dataset sales. The resulting plot is shown in Figure 3.1.

# plot num_of_orders vs. sales

plot(sales$num_of_orders,sales$sales_total,

main=“Number of Orders vs. Sales”)

 


Figure 3.1 Graphically examining the data

Each point corresponds to the number of orders and the total sales for each customer. The plot indicates that the annual sales are proportional to the number of orders placed. Although the observed relationship between these two variables is not purely linear, the analyst decided to apply linear regression using the lm() function as a first step in the modeling process.

results <- lm(sales$sales_total ˜ sales$num_of_orders)

results

Call:

lm(formula = sales$sales_total ˜ sales$num_of_orders)

Coefficients:

(Intercept) sales$num_of_orders

-154.1 166.2

The resulting intercept and slope values are –154.1 and 166.2, respectively, for the fitted linear equation. However, results stores considerably more information that can be examined with the summary() function. Details on the contents of results are examined by applying the attributes() function. Because regression analysis is presented in more detail later in the book, the reader should not overly focus on interpreting the following output.

summary(results)

Call:

lm(formula = sales$sales_total ˜ sales$num_of_orders)

Residuals:

Min 1Q Median 3Q Max

-666.5 -125.5 -26.7 86.6 4103.4

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -154.128 4.129 -37.33

sales$num_of_orders 166.221 1.462 113.66

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

Residual standard error: 210.8 on 9998 degrees of freedom

Multiple R-squared: 0.5637, Adjusted R-squared: 0.5637

F-statistic: 1.292e+04 on 1 and 9998 DF, p-value: < 2.2e-16

The summary() function is an example of a generic function. A generic function is a group of functions sharing the same name but behaving differently depending on the number and the type of arguments they receive. Utilized previously, plot() is another example of a generic function; the plot is determined by the passed variables. Generic functions are used throughout this chapter and the book. In the final portion of the example, the following R code uses the generic function hist() to generate a histogram (Figure 3.2) of the residuals stored in results. The function call illustrates that optional parameter values can be passed. In this case, the number of breaks is specified to observe the large residuals.

# perform some diagnostics on the fitted model

# plot histogram of the residuals

hist(results$residuals, breaks = 800)

 


Figure 3.2 Evidence of large residuals

This simple example illustrates a few of the basic model planning and building tasks that may occur in Phases 3 and 4 of the Data Analytics Lifecycle. Throughout this chapter, it is useful to envision how the presented R functionality will be used in a more comprehensive analysis.

 

3.1.1 R Graphical User Interfaces

R software uses a command-line interface (CLI) that is similar to the BASH shell in Linux or the interactive versions of scripting languages such as Python. UNIX and Linux users can enter command R at the terminal prompt to use the CLI. For Windows installations, R comes with RGui.exe, which provides a basic graphical user interface (GUI). However, to improve the ease of writing, executing, and debugging R code, several additional GUIs have been written for R. Popular GUIs include the R commander [3], Rattle [4], and RStudio [5]. This section presents a brief overview of RStudio, which was used to build the R examples in this book. Figure 3.3 provides a screenshot of the previous R code example executed in RStudio.

 


Figure 3.3 RStudio GUI

The four highlighted window panes follow.

• Scripts: Serves as an area to write and save R code

• Workspace: Lists the datasets and variables in the R environment

• Plots: Displays the plots generated by the R code and provides a straightforward mechanism to export the plots

• Console: Provides a history of the executed R code and the output

Additionally, the console pane can be used to obtain help information on R. Figure 3.4 illustrates that by entering ?lm at the console prompt, the help details of the lm() function are provided on the right. Alternatively, help(lm) could have been entered at the console prompt.

 


Figure 3.4 Accessing help in Rstudio

Functions such as edit() and fix() allow the user to update the contents of an R variable. Alternatively, such changes can be implemented with RStudio by selecting the appropriate variable from the workspace pane.

R allows one to save the workspace environment, including variables and loaded libraries, into an .Rdata file using the save.image() function. An existing .Rdata file can be loaded using the load.image() function. Tools such as RStudio prompt the user for whether the developer wants to save the workspace connects prior to exiting the GUI.

The reader is encouraged to install R and a preferred GUI to try out the R examples provided in the book and utilize the help functionality to access more details about the discussed topics.

 

3.1.2 Data Import and Export

In the annual retail sales example, the dataset was imported into R using the read.csv() function as in the following code.

sales <- read.csv(“c:/data/yearly_sales.csv”)

R uses a forward slash (/) as the separator character in the directory and file paths. This convention makes script files somewhat more portable at the expense of some initial confusion on the part of Windows users, who may be accustomed to using a backslash (\) as a separator. To simplify the import of multiple files with long path names, the setwd() function can be used to set the working directory for the subsequent import and export operations, as shown in the following R code.

setwd(“c:/data/”)

sales <- read.csv(“yearly_sales.csv”)

Other import functions include read.table() and read.delim(), which are intended to import other common file types such as TXT. These functions can also be used to import the yearly_sales .csv file, as the following code illustrates.

sales_table <- read.table(“yearly_sales.csv”, header=TRUE, sep=”,”)

sales_delim <- read.delim(“yearly_sales.csv”, sep=”,”)

The main difference between these import functions is the default values. For example, the read .delim() function expects the column separator to be a tab (“\t“). In the event that the numerical data in a data file uses a comma for the decimal, R also provides two additional functions—read.csv2() and read.delim2()—to import such data. Table 3.1 includes the expected defaults for headers, column separators, and decimal point notations.

Table 3.1 Import Function Defaults

 


The analogous R functions such as write.table(), write.csv(), and write.csv2() enable exporting of R datasets to an external file. For example, the following R code adds an additional column to the sales dataset and exports the modified dataset to an external file.

# add a column for the average sales per order

sales$per_order <- sales$sales_total/sales$num_of_orders

# export data as tab delimited without the row names

write.table(sales,“sales_modified.txt”, sep=”\t”, row.names=FALSE

Sometimes it is necessary to read data from a database management system (DBMS). R packages such as DBI [6] and RODBC [7] are available for this purpose. These packages provide database interfaces for communication between R and DBMSs such as MySQL, Oracle, SQL Server, PostgreSQL, and Pivotal Greenplum. The following R code demonstrates how to install the RODBC package with the install .packages() function. The library() function loads the package into the R workspace. Finally, a connector (conn) is initialized for connecting to a Pivotal Greenplum database training2 via open database connectivity (ODBC) with user user. The training2 database must be defined either in the /etc/ODBC.ini configuration file or using the Administrative Tools under the Windows Control Panel.

install.packages(“RODBC”)

library(RODBC)

conn <- odbcConnect(“training2”, uid=“user”, pwd=“password”)

The connector needs to be present to submit a SQL query to an ODBC database by using the sqlQuery() function from the RODBC package. The following R code retrieves specific columns from the housing table in which household income (hinc) is greater than $1,000,000.

housing_data <- sqlQuery(conn, “select serialno, state, persons, rooms

from housing

where hinc > 1000000”)

head(housing_data)

serialno state persons rooms

1 3417867 6 2 7

2 3417867 6 2 7

3 4552088 6 5 9

4 4552088 6 5 9

5 8699293 6 5 5

6 8699293 6 5 5

Although plots can be saved using the RStudio GUI, plots can also be saved using R code by specifying the appropriate graphic devices. Using the jpeg() function, the following R code creates a new JPEG file, adds a histogram plot to the file, and then closes the file. Such techniques are useful when automating standard reports. Other functions, such as png(), bmp(), pdf(), and postscript(), are available in R to save plots in the desired format.

jpeg(file=“c:/data/sales_hist.jpeg”) # create a new jpeg file

hist(sales$num_of_orders) # export histogram to jpeg

dev.off() # shut off the graphic device

More information on data imports and exports can be found at http://cran.r[1]project.org/doc/manuals/r-release/R-data.html, such as how to import datasets from statistical software packages including Minitab, SAS, and SPSS.

 

3.1.3 Attribute and Data Types

In the earlier example, the sales variable contained a record for each customer. Several characteristics, such as total annual sales, number of orders, and gender, were provided for each customer. In general, these characteristics or attributes provide the qualitative and quantitative measures for each item or subject of interest. Attributes can be categorized into four types: nominal, ordinal, interval, and ratio (NOIR) [8]. Table 3.2 distinguishes these four attribute types and shows the operations they support. Nominal and ordinal attributes are considered categorical attributes, whereas interval and ratio attributes are considered numeric attributes.

Table 3.2 NOIR Attribute Types

 


Data of one attribute type may be converted to another. For example, the quality of diamonds {Fair, Good, Very Good, Premium, Ideal} is considered ordinal but can be converted to nominal {Good, Excellent} with a defined mapping. Similarly, a ratio attribute like Age can be converted into an ordinal attribute such as {Infant, Adolescent, Adult, Senior}. Understanding the attribute types in a given dataset is important to ensure that the appropriate descriptive statistics and analytic methods are applied and properly interpreted. For example, the mean and standard deviation of U.S. postal ZIP codes are not very meaningful or appropriate. Proper handling of categorical variables will be addressed in subsequent chapters. Also, it is useful to consider these attribute types during the following discussion on R data types.

Numeric, Character, and Logical Data Types

Numeric, Character, and Logical Data Types

i <- 1 # create a numeric variable

sport <- “football” # create a character variable

flag <- TRUE # create a logical variable

R provides several functions, such as class() and typeof(), to examine the characteristics of a given variable. The class() function represents the abstract class of an object. The typeof() function determines the way an object is stored in memory. Although i appears to be an integer, i is internally stored using double precision. To improve the readability of the code segments in this section, the inline R comments are used to explain the code or to provide the returned values.

class(i) # returns “numeric”

typeof(i) # returns “double”

class(sport) # returns “character”

typeof(sport) # returns “character”

class(flag) # returns “logical”

typeof(flag) # returns “logical”

Additional R functions exist that can test the variables and coerce a variable into a specific type. The following R code illustrates how to test if i is an integer using the is.integer() function and to coerce i into a new integer variable, j, using the as.integer() function. Similar functions can be applied for double, character, and logical types.

is.integer(i) # returns FALSE

j <- as.integer(i) # coerces contents of i into an integer

is.integer(j) # returns TRUE

The application of the length() function reveals that the created variables each have a length of 1. One might have expected the returned length of sport to have been 8 for each of the characters in the string “football”. However, these three variables are actually one element, vectors.

length(i) # returns 1

length(flag) # returns 1

length(sport) # returns 1 (not 8 for “football”)

Vectors

Vectors are a basic building block for data in R. As seen previously, simple R variables are actually vectors. A vector can only consist of values in the same class. The tests for vectors can be conducted using the is.vector() function.

is.vector(i) # returns TRUE

is.vector(flag) # returns TRUE

is.vector(sport) # returns TRUE

R provides functionality that enables the easy creation and manipulation of vectors. The following R code illustrates how a vector can be created using the combine function, c() or the colon operator, :, to build a vector from the sequence of integers from 1 to 5. Furthermore, the code shows how the values of an existing vector can be easily modified or accessed. The code, related to the z vector, indicates how logical comparisons can be built to extract certain elements of a given vector.

u <- c(“red”, “yellow”, “blue”) # create a vector “red” “yellow” “blue”

u # returns “red” “yellow” “blue”

u[1] # returns “red” (1st element in u)

v <- 1:5 # create a vector 1 2 3 4 5

v # returns 1 2 3 4 5

sum(v) # returns 15

w <- v * 2 # create a vector 2 4 6 8 10

w # returns 2 4 6 8 10

w[3] # returns 6 (the 3rd element of w)

w[3] # returns 6 (the 3rd element of w)

z # returns 3 6 9 12 15

z # returns 3 6 9 12 15

z[z > 8] # returns 9 12 15

z[z > 8 | z < 5] # returns 3 9 12 15 (“|” denotes “or”)

Sometimes it is necessary to initialize a vector of a specific length and then populate the content of the vector later. The vector() function, by default, creates a logical vector. A vector of a different type can be specified by using the mode parameter. The vector c, an integer vector of length 0, may be useful when the number of elements is not initially known and the new elements will later be added to the end of the vector as the values become available.

a <- vector(length=3) # create a logical vector of length 3

a # returns FALSE FALSE FALSE

b <- vector(mode=“numeric”, 3) # create a numeric vector of length 3

typeof(b) # returns “double”

b[2] <- 3.1 # assign 3.1 to the 2nd element

b # returns 0.0 3.1 0.0

c <- vector(mode=“integer”, 0) # create an integer vector of length 0

c # returns integer(0)

length(c) # returns 0

Although vectors may appear to be analogous to arrays of one dimension, they are technically dimensionless, as seen in the following R code. The concept of arrays and matrices is addressed in the following discussion.

length(b) # returns 3

dim(b) # returns NULL (an undefined value)

Arrays and Matrices

The array() function can be used to restructure a vector as an array. For example, the following R code builds a three-dimensional array to hold the quarterly sales for three regions over a two-year period and then assign the sales amount of $158,000 to the second region for the first quarter of the first year.

# the dimensions are 3 regions, 4 quarters, and 2 years

quarterly_sales <- array(0, dim=c(3,4,2))

quarterly_sales[2,1,1] <- 158000

quarterly_sales

, , 1

[,1] [,2] [,3] [,4]

[1,] 0 0 0 0

[2,] 158000 0 0 0

[3,] 0 0 0 0

, , 2

[,1] [,2] [,3] [,4]

[1,] 0 0 0 0

[2,] 0 0 0 0

[3,] 0 0 0 0

A two-dimensional array is known as a matrix. The following code initializes a matrix to hold the quarterly sales for the three regions. The parameters nrow and ncol define the number of rows and columns, respectively, for the sales_matrix.

sales_matrix <- matrix(0, nrow = 3, ncol = 4)

sales_matrix

[,1] [,2] [,3] [,4]

[1,] 0 0 0 0

[2,] 0 0 0 0

[3,] 0 0 0 0

R provides the standard matrix operations such as addition, subtraction, and multiplication, as well as the transpose function t() and the inverse matrix function matrix.inverse() included in the matrixcalc package. The following R code builds a 3 × 3 matrix, M, and multiplies it by its inverse to obtain the identity matrix.

library(matrixcalc)

M <- matrix(c(1,3,3,5,0,4,3,3,3),nrow = 3,ncol = 3) # build a 3x3 matrix

M %*% matrix.inverse(M) # multiply M by inverse(M)

[,1] [,2] [,3]

[1,] 1 0 0

[2,] 0 1 0

[3,] 0 0 1

 

Data Frames

Similar to the concept of matrices, data frames provide a structure for storing and accessing several variables of possibly different data types. In fact, as the is.data.frame() function indicates, a data frame was created by the read.csv() function at the beginning of the chapter.

#import a CSV file of the total annual sales for each customer

sales <- read.csv(“c:/data/yearly_sales.csv”)

is.data.frame(sales) # returns TRUE

As seen earlier, the variables stored in the data frame can be easily accessed using the $ notation. The following R code illustrates that in this example, each variable is a vector with the exception of gender, which was, by a read.csv() default, imported as a factor. Discussed in detail later in this section, a factor denotes a categorical variable, typically with a few finite levels such as “F” and “M” in the case of gender.

length(sales$num_of_orders) # returns 10000 (number of customers)

is.vector(sales$cust_id) # returns TRUE

is.vector(sales$sales_total) # returns TRUE

is.vector(sales$num_of_orders) # returns TRUE

is.vector(sales$gender) # returns FALSE

is.factor(sales$gender) # returns TRUE

Because of their flexibility to handle many data types, data frames are the preferred input format for many of the modeling functions available in R. The following use of the str() function provides the structure of the sales data frame. This function identifies the integer and numeric (double) data types, the factor variables and levels, as well as the first few values for each variable.

str(sales) # display structure of the data frame object

‘data.frame’: 10000 obs. of 4 variables:

$ cust_id : int 100001 100002 100003 100004 100005 100006 …

$ sales_total : num 800.6 217.5 74.6 498.6 723.1 …

$ num_of_orders: int 3 3 2 3 4 2 2 2 2 2 …

$ gender : Factor w/ 2 levels “F”,“M”: 1 1 2 2 1 1 2 2 1 2

In the simplest sense, data frames are lists of variables of the same length. A subset of the data frame can be retrieved through subsetting operators. R’s subsetting operators are powerful in that they allow one to express complex operations in a succinct fashion and easily retrieve a subset of the dataset.

# extract the fourth column of the sales data frame

sales[,4]

# extract the gender column of the sales data frame

sales$gender

# retrieve the first two rows of the data frame

sales[1:2,]

# retrieve the first, third, and fourth columns

sales[,c(1,3,4)]

# retrieve both the cust_id and the sales_total columns

sales[,c(“cust_id”, “sales_total”)]

# retrieve all the records whose gender is female

sales[sales$gender==“F”,]

The following R code shows that the class of the sales variable is a data frame. However, the type of the sales variable is a list. A list is a collection of objects that can be of various types, including other lists.

class(sales)

“data.frame”

typeof(sales)

“list”

 

Lists

Lists can contain any type of objects, including other lists. Using the vector v and the matrix M created in earlier examples, the following R code creates assortment, a list of different object types.

# build an assorted list of a string, a numeric, a list, a vector,

# and a matrix

housing <- list(“own”, “rent”)

assortment <- list(“football”, 7.5, housing, v, M)

assortment

[[1]]

[1] “football”

[[2]]

[1] 7.5

[[3]]

[[3]][[1]]

[1] “own”

[[3]][[2]]

[1] “rent”

[[4]]

[1] 1 2 3 4 5

[[5]]

[,1] [,2] [,3]

[1,] 1 5 3

[2,] 3 0 3

[3,] 3 4 3

In displaying the contents of assortment, the use of the double brackets, [[]], is of particular importance. As the following R code illustrates, the use of the single set of brackets only accesses an item in the list, not its content.

# examine the fifth object, M, in the list

class(assortment[5]) # returns “list”

length(assortment[5]) # returns 1

class(assortment[[5]]) # returns “matrix”

length(assortment[[5]]) # returns 9 (for the 3x3 matrix)

 

As presented earlier in the data frame discussion, the str() function offers details about the structure of a list.

str(assortment)

List of 5

$ : chr “football”

$ : num 7.5

$ :List of 2

..$ : chr “own”

..$ : chr “rent”

$ : int [1:5] 1 2 3 4 5

$ : num [1:3, 1:3] 1 3 3 5 0 4 3 3 3

Factors

Factors were briefly introduced during the discussion of the gender variable in the data frame sales. In this case, gender could assume one of two levels: F or M. Factors can be ordered or not ordered. In the case of gender, the levels are not ordered.

class(sales$gender) # returns “factor”

is.ordered(sales$gender) # returns FALSE

Included with the ggplot2 package, the diamonds data frame contains three ordered factors. Examining the cut factor, there are five levels in order of improving cut: Fair, Good, Very Good, Premium, and Ideal. Thus, sales$gender contains nominal data, and diamonds$cut contains ordinal data.

head(sales$gender) # display first six values and the levels

F F M M F F

Levels: F M

library(ggplot2)

data(diamonds) # load the data frame into the R workspace

str(diamonds)

‘data.frame’: 53940 obs. of 10 variables:

$ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 …

$ cut : Ord.factor w/ 5 levels “Fair”<“Good”<_x002e_.: _x0035_ _x0034_ _x0032_ _x0034_ _x0032_ _x0033_ _x2026__x003c__x0021_--EndFragment--> < . . : 5 4 2 4 2 3 …

$ color : Ord.factor w/ 7 levels “D”<“E”<“F”<“G”<_x002e_.: _x0032_ _x0032_ _x0032_ _x0036_ _x0037_ _x0037_ _x2026_> < . . : 2 2 2 6 7 7

$ clarity: Ord.factor w/ 8 levels “I1”<“SI2”<“SI1”<_x002e_.: _x0032_ _x0033_ _x0035_ _x0034_ _x0032_ _x2026_> < . . : 2 3 5 4 2 …

$ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 …

$ table : num 55 61 65 58 58 57 57 55 61 61 …

$ price : int 326 326 327 334 335 336 336 337 337 338 …

$ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 …

$ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 …

$ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 …

head(diamonds$cut) # display first six values and the levels

Ideal Premium Good Premium Good Very Good

Levels: Fair < Good < Very Good < Premium < Ideal

Suppose it is decided to categorize sales$sales_totals into three groups—small, medium, and big—according to the amount of the sales with the following code. These groupings are the basis for the new ordinal factor, spender, with levels {small, medium, big}.

# build an empty character vector of the same length as sales

sales_group <- vector(mode=“character”,

length=length(sales$sales_total))

# group the customers according to the sales amount

sales_group[sales$sales_total<100] ><- “small”

sales_group[sales$sales_total>=100 & sales$sales_total<500] ><- “medium”

sales_group[sales$sales_total>=500] <- “big”

# create and add the ordered factor to the sales data frame

spender <- factor(sales_group,levels=c(“small”, “medium”, “big”),

ordered = TRUE)

sales <- cbind(sales,spender)

str(sales$spender)

Ord.factor w/ 3 levels “small”<“medium”<_x002e_.: _x0033_ _x0032_ _x0031_ _x0032_ _x0033_ _x0031_ _x0031_ _x0031_ _x0032_ _x0031_ _x2026_> < . . : 3 2 1 2 3 1 1 1 2 1

head(sales$spender)

big medium small medium big small

Levels: small < medium < big

The cbind() function is used to combine variables column-wise. The rbind() function is used to combine datasets row-wise. The use of factors is important in several R statistical modeling functions, such as analysis of variance, aov(), presented later in this chapter, and the use of contingency tables, discussed next.

Contingency Tables

In R, table refers to a class of objects used to store the observed counts across the factors for a given dataset. Such a table is commonly referred to as a contingency table and is the basis for performing a statistical test on the independence of the factors used to build the table. The following R code builds a contingency table based on the sales$gender and sales$spender factors.

# build a contingency table based on the gender and spender factors

sales_table <- table(sales$gender,sales$spender)

sales_table

small medium big

F 1726 2746 563

M 1656 2723 586

class(sales_table) # returns “table”

typeof(sales_table) # returns “integer”

dim(sales_table) # returns 2 3

# performs a chi-squared test

summary(sales_table)

Number of cases in table: 10000

Number of factors: 2

Test for independence of all factors:

Chisq = 1.516, df = 2, p-value = 0.4686

Based on the observed counts in the table, the summary() function performs a chi-squared test on the independence of the two factors. Because the reported p-value is greater than 0.05, the assumed independence of the two factors is not rejected. Hypothesis testing and p-values are covered in more detail later in this chapter. Next, applying descriptive statistics in R is examined.

 

3.1.4 Descriptive Statistics

It has already been shown that the summary() function provides several descriptive statistics, such as the mean and median, about a variable such as the sales data frame. The results now include the counts for the three levels of the spender variable based on the earlier examples involving factors.

summary(sales)

cust_id sales_total num_of_orders gender spender

Min. :100001 Min. : 30.02 Min. : 1.000 F:5035 small :3382

1st Qu.:102501 1st Qu.: 80.29 1st Qu.: 2.000 M:4965 medium:5469

Median :105001 Median : 151.65 Median : 2.000 big :1149

Mean :105001 Mean : 249.46 Mean : 2.428

3rd Qu.:107500 3rd Qu.: 295.50 3rd Qu.: 3.000

Max. :110000 Max. :7606.09 Max. :22.000

The following code provides some common R functions that include descriptive statistics.

In parentheses, the comments describe the functions.

# to simplify the function calls, assign

x <- sales$sales_total

y <- sales$num_of_orders

cor(x,y) # returns 0.7508015 (correlation)

cov(x,y) # returns 345.2111 (covariance)

IQR(x) # returns 215.21 (interquartile range)

mean(x) # returns 249.4557 (mean)

median(x) # returns 151.65 (median)

range(x) # returns 30.02 7606.09 (min max)

sd(x) # returns 319.0508 (std. dev.)

var(x) # returns 101793.4 (variance)

The IQR() function provides the difference between the third and the first quartiles. The other functions are fairly self-explanatory by their names. The reader is encouraged to review the available help files for acceptable inputs and possible options.

The function apply() is useful when the same function is to be applied to several variables in a data frame. For example, the following R code calculates the standard deviation for the first three variables in sales. In the code, setting MARGIN=2 specifies that the sd() function is applied over the columns. Other functions, such as lapply() and sapply(), apply a function to a list or vector. Readers can refer to the R help files to learn how to use these functions.

apply(sales[,c(1:3)], MARGIN=2, FUN=sd)

cust_id sales_total num_of_orders

2886.895680 319.050782 1.441119

Additional descriptive statistics can be applied with user-defined functions. The following R code defines a function, my_range(), to compute the difference between the maximum and minimum values returned by the range() function. In general, user-defined functions are useful for any task or operation that needs to be frequently repeated. More information on user-defined functions is available by entering help(“function”) in the console.

# build a function to provide the difference between

# the maximum and the minimum values

my_range <- function(v) {range(v)[2] - range(v)[1]}

my_range(x)

7576.07

 

No comments:

Post a Comment

Tell your requirements and How this blog helped you.

Labels

ACTUATORS (10) AIR CONTROL/MEASUREMENT (38) ALARMS (20) ALIGNMENT SYSTEMS (2) Ammeters (12) ANALYSERS/ANALYSIS SYSTEMS (33) ANGLE MEASUREMENT/EQUIPMENT (5) APPARATUS (6) Articles (3) AUDIO MEASUREMENT/EQUIPMENT (1) BALANCES (4) BALANCING MACHINES/SERVICES (1) BOILER CONTROLS/ACCESSORIES (5) BRIDGES (7) CABLES/CABLE MEASUREMENT (14) CALIBRATORS/CALIBRATION EQUIPMENT (19) CALIPERS (3) CARBON ANALYSERS/MONITORS (5) CHECKING EQUIPMENT/ACCESSORIES (8) CHLORINE ANALYSERS/MONITORS/EQUIPMENT (1) CIRCUIT TESTERS CIRCUITS (2) CLOCKS (1) CNC EQUIPMENT (1) COIL TESTERS EQUIPMENT (4) COMMUNICATION EQUIPMENT/TESTERS (1) COMPARATORS (1) COMPASSES (1) COMPONENTS/COMPONENT TESTERS (5) COMPRESSORS/COMPRESSOR ACCESSORIES (2) Computers (1) CONDUCTIVITY MEASUREMENT/CONTROL (3) CONTROLLERS/CONTROL SYTEMS (35) CONVERTERS (2) COUNTERS (4) CURRENT MEASURMENT/CONTROL (2) Data Acquisition Addon Cards (4) DATA ACQUISITION SOFTWARE (5) DATA ACQUISITION SYSTEMS (22) DATA ANALYSIS/DATA HANDLING EQUIPMENT (1) DC CURRENT SYSTEMS (2) DETECTORS/DETECTION SYSTEMS (3) DEVICES (1) DEW MEASURMENT/MONITORING (1) DISPLACEMENT (2) DRIVES (2) ELECTRICAL/ELECTRONIC MEASUREMENT (3) ENCODERS (1) ENERGY ANALYSIS/MEASUREMENT (1) EQUIPMENT (6) FLAME MONITORING/CONTROL (5) FLIGHT DATA ACQUISITION and ANALYSIS (1) FREQUENCY MEASUREMENT (1) GAS ANALYSIS/MEASURMENT (1) GAUGES/GAUGING EQUIPMENT (15) GLASS EQUIPMENT/TESTING (2) Global Instruments (1) Latest News (35) METERS (1) SOFTWARE DATA ACQUISITION (2) Supervisory Control - Data Acquisition (1)