google.com, pub-4497197638514141, DIRECT, f08c47fec0942fa0 Industries Needs: February 2022

Wednesday, February 23, 2022

Data Science and Big Data Analytics

  

Discovering, Analyzing, Visualizing and Presenting Data

 

Review of Basic Data Analytic Methods Using R

Key Concepts

1. Basic features of R

2. Data exploration and analysis with R

3. Statistical methods for evaluation

The previous chapter presented the six phases of the Data Analytics Lifecycle.

• Phase 1: Discovery

• Phase 2: Data Preparation

• Phase 3: Model Planning

• Phase 4: Model Building

• Phase 5: Communicate Results

• Phase 6: Operationalize

The first three phases involve various aspects of data exploration. In general, the success of a data analysis project requires a deep understanding of the data. It also requires a toolbox for mining and presenting the data. These activities include the study of the data in terms of basic statistical measures and creation of graphs and plots to visualize and identify relationships and patterns. Several free or commercial tools are available for exploring, conditioning, modeling, and presenting data. Because of its popularity and versatility, the open-source programming language R is used to illustrate many of the presented analytical tasks and models in this book.

This chapter introduces the basic functionality of the R programming language and environment. The first section gives an overview of how to use R to acquire, parse, and filter the data as well as how to obtain some basic descriptive statistics on a dataset. The second section examines using R to perform exploratory data analysis tasks using visualization. The final section focuses on statistical inference, such as hypothesis testing and analysis of variance in R.

 

3.1 Introduction to R

R is a programming language and software framework for statistical analysis and graphics. Available for use under the GNU General Public License [1], R software and installation instructions can be obtained via the Comprehensive R Archive and Network [2]. This section provides an overview of the basic functionality of R. In later chapters, this foundation in R is utilized to demonstrate many of the presented analytical techniques.

Before delving into specific operations and functions of R later in this chapter, it is important to understand the flow of a basic R script to address an analytical problem. The following R code illustrates a typical analytical situation in which a dataset is imported, the contents of the dataset are examined, and some modeling building tasks are executed. Although the reader may not yet be familiar with the R syntax, the code can be followed by reading the embedded comments, denoted by #. In the following scenario, the annual sales in U.S. dollars for 10,000 retail customers have been provided in the form of a comma-separated-value (CSV) file. The read.csv() function is used to import the CSV file. This dataset is stored to the R variable sales using the assignment operator <-.

# import a CSV file of the total annual sales for each customer

sales <- read.csv(“c:/data/yearly_sales.csv”)

# examine the imported dataset

head(sales)

summary(sales)

# plot num_of_orders vs. sales

plot(sales$num_of_orders,sales$sales_total,

main=“Number of Orders vs. Sales”)

# perform a statistical analysis (fit a linear regression model)

results <- lm(sales$sales_total ˜ sales$num_of_orders)

summary(results)

# perform some diagnostics on the fitted model

# plot histogram of the residuals

hist(results$residuals, breaks = 800)

In this example, the data file is imported using the read.csv() function. Once the file has been imported, it is useful to examine the contents to ensure that the data was loaded properly as well as to become familiar with the data. In the example, the head() function, by default, displays the first six records of sales.

# examine the imported dataset

head(sales)

cust_id sales_total num_of_orders gender

1 100001 800.64 3 F

2 100002 217.53 3 F

3 100003 74.58 2 M

4 100004 498.60 3 M

5 100005 723.11 4 F

6 100006 69.43 2 F

The summary() function provides some descriptive statistics, such as the mean and median, for each data column. Additionally, the minimum and maximum values as well as the 1st and 3rd quartiles are provided. Because the gender column contains two possible characters, an “F” (female) or “M” (male), the summary() function provides the count of each character’s occurrence.

summary(sales)

cust_id sales_total num_of_orders gender

Min. :100001 Min. : 30.02 Min. : 1.000 F:5035

1st Qu.:102501 1st Qu.: 80.29 1st Qu.: 2.000 M:4965

Median :105001 Median : 151.65 Median : 2.000

Mean :105001 Mean : 249.46 Mean : 2.428

3rd Qu.:107500 3rd Qu.: 295.50 3rd Qu.: 3.000

Max. :110000 Max. :7606.09 Max. :22.000

Plotting a dataset’s contents can provide information about the relationships between the various columns. In this example, the plot() function generates a scatterplot of the number of orders (sales$num_of_orders) against the annual sales (sales$sales_total). The $ is used to reference a specific column in the dataset sales. The resulting plot is shown in Figure 3.1.

# plot num_of_orders vs. sales

plot(sales$num_of_orders,sales$sales_total,

main=“Number of Orders vs. Sales”)

 


Figure 3.1 Graphically examining the data

Each point corresponds to the number of orders and the total sales for each customer. The plot indicates that the annual sales are proportional to the number of orders placed. Although the observed relationship between these two variables is not purely linear, the analyst decided to apply linear regression using the lm() function as a first step in the modeling process.

results <- lm(sales$sales_total ˜ sales$num_of_orders)

results

Call:

lm(formula = sales$sales_total ˜ sales$num_of_orders)

Coefficients:

(Intercept) sales$num_of_orders

-154.1 166.2

The resulting intercept and slope values are –154.1 and 166.2, respectively, for the fitted linear equation. However, results stores considerably more information that can be examined with the summary() function. Details on the contents of results are examined by applying the attributes() function. Because regression analysis is presented in more detail later in the book, the reader should not overly focus on interpreting the following output.

summary(results)

Call:

lm(formula = sales$sales_total ˜ sales$num_of_orders)

Residuals:

Min 1Q Median 3Q Max

-666.5 -125.5 -26.7 86.6 4103.4

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -154.128 4.129 -37.33

sales$num_of_orders 166.221 1.462 113.66

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

Residual standard error: 210.8 on 9998 degrees of freedom

Multiple R-squared: 0.5637, Adjusted R-squared: 0.5637

F-statistic: 1.292e+04 on 1 and 9998 DF, p-value: < 2.2e-16

The summary() function is an example of a generic function. A generic function is a group of functions sharing the same name but behaving differently depending on the number and the type of arguments they receive. Utilized previously, plot() is another example of a generic function; the plot is determined by the passed variables. Generic functions are used throughout this chapter and the book. In the final portion of the example, the following R code uses the generic function hist() to generate a histogram (Figure 3.2) of the residuals stored in results. The function call illustrates that optional parameter values can be passed. In this case, the number of breaks is specified to observe the large residuals.

# perform some diagnostics on the fitted model

# plot histogram of the residuals

hist(results$residuals, breaks = 800)

 


Figure 3.2 Evidence of large residuals

This simple example illustrates a few of the basic model planning and building tasks that may occur in Phases 3 and 4 of the Data Analytics Lifecycle. Throughout this chapter, it is useful to envision how the presented R functionality will be used in a more comprehensive analysis.

 

3.1.1 R Graphical User Interfaces

R software uses a command-line interface (CLI) that is similar to the BASH shell in Linux or the interactive versions of scripting languages such as Python. UNIX and Linux users can enter command R at the terminal prompt to use the CLI. For Windows installations, R comes with RGui.exe, which provides a basic graphical user interface (GUI). However, to improve the ease of writing, executing, and debugging R code, several additional GUIs have been written for R. Popular GUIs include the R commander [3], Rattle [4], and RStudio [5]. This section presents a brief overview of RStudio, which was used to build the R examples in this book. Figure 3.3 provides a screenshot of the previous R code example executed in RStudio.

 


Figure 3.3 RStudio GUI

The four highlighted window panes follow.

• Scripts: Serves as an area to write and save R code

• Workspace: Lists the datasets and variables in the R environment

• Plots: Displays the plots generated by the R code and provides a straightforward mechanism to export the plots

• Console: Provides a history of the executed R code and the output

Additionally, the console pane can be used to obtain help information on R. Figure 3.4 illustrates that by entering ?lm at the console prompt, the help details of the lm() function are provided on the right. Alternatively, help(lm) could have been entered at the console prompt.

 


Figure 3.4 Accessing help in Rstudio

Functions such as edit() and fix() allow the user to update the contents of an R variable. Alternatively, such changes can be implemented with RStudio by selecting the appropriate variable from the workspace pane.

R allows one to save the workspace environment, including variables and loaded libraries, into an .Rdata file using the save.image() function. An existing .Rdata file can be loaded using the load.image() function. Tools such as RStudio prompt the user for whether the developer wants to save the workspace connects prior to exiting the GUI.

The reader is encouraged to install R and a preferred GUI to try out the R examples provided in the book and utilize the help functionality to access more details about the discussed topics.

 

3.1.2 Data Import and Export

In the annual retail sales example, the dataset was imported into R using the read.csv() function as in the following code.

sales <- read.csv(“c:/data/yearly_sales.csv”)

R uses a forward slash (/) as the separator character in the directory and file paths. This convention makes script files somewhat more portable at the expense of some initial confusion on the part of Windows users, who may be accustomed to using a backslash (\) as a separator. To simplify the import of multiple files with long path names, the setwd() function can be used to set the working directory for the subsequent import and export operations, as shown in the following R code.

setwd(“c:/data/”)

sales <- read.csv(“yearly_sales.csv”)

Other import functions include read.table() and read.delim(), which are intended to import other common file types such as TXT. These functions can also be used to import the yearly_sales .csv file, as the following code illustrates.

sales_table <- read.table(“yearly_sales.csv”, header=TRUE, sep=”,”)

sales_delim <- read.delim(“yearly_sales.csv”, sep=”,”)

The main difference between these import functions is the default values. For example, the read .delim() function expects the column separator to be a tab (“\t“). In the event that the numerical data in a data file uses a comma for the decimal, R also provides two additional functions—read.csv2() and read.delim2()—to import such data. Table 3.1 includes the expected defaults for headers, column separators, and decimal point notations.

Table 3.1 Import Function Defaults

 


The analogous R functions such as write.table(), write.csv(), and write.csv2() enable exporting of R datasets to an external file. For example, the following R code adds an additional column to the sales dataset and exports the modified dataset to an external file.

# add a column for the average sales per order

sales$per_order <- sales$sales_total/sales$num_of_orders

# export data as tab delimited without the row names

write.table(sales,“sales_modified.txt”, sep=”\t”, row.names=FALSE

Sometimes it is necessary to read data from a database management system (DBMS). R packages such as DBI [6] and RODBC [7] are available for this purpose. These packages provide database interfaces for communication between R and DBMSs such as MySQL, Oracle, SQL Server, PostgreSQL, and Pivotal Greenplum. The following R code demonstrates how to install the RODBC package with the install .packages() function. The library() function loads the package into the R workspace. Finally, a connector (conn) is initialized for connecting to a Pivotal Greenplum database training2 via open database connectivity (ODBC) with user user. The training2 database must be defined either in the /etc/ODBC.ini configuration file or using the Administrative Tools under the Windows Control Panel.

install.packages(“RODBC”)

library(RODBC)

conn <- odbcConnect(“training2”, uid=“user”, pwd=“password”)

The connector needs to be present to submit a SQL query to an ODBC database by using the sqlQuery() function from the RODBC package. The following R code retrieves specific columns from the housing table in which household income (hinc) is greater than $1,000,000.

housing_data <- sqlQuery(conn, “select serialno, state, persons, rooms

from housing

where hinc > 1000000”)

head(housing_data)

serialno state persons rooms

1 3417867 6 2 7

2 3417867 6 2 7

3 4552088 6 5 9

4 4552088 6 5 9

5 8699293 6 5 5

6 8699293 6 5 5

Although plots can be saved using the RStudio GUI, plots can also be saved using R code by specifying the appropriate graphic devices. Using the jpeg() function, the following R code creates a new JPEG file, adds a histogram plot to the file, and then closes the file. Such techniques are useful when automating standard reports. Other functions, such as png(), bmp(), pdf(), and postscript(), are available in R to save plots in the desired format.

jpeg(file=“c:/data/sales_hist.jpeg”) # create a new jpeg file

hist(sales$num_of_orders) # export histogram to jpeg

dev.off() # shut off the graphic device

More information on data imports and exports can be found at http://cran.r[1]project.org/doc/manuals/r-release/R-data.html, such as how to import datasets from statistical software packages including Minitab, SAS, and SPSS.

 

3.1.3 Attribute and Data Types

In the earlier example, the sales variable contained a record for each customer. Several characteristics, such as total annual sales, number of orders, and gender, were provided for each customer. In general, these characteristics or attributes provide the qualitative and quantitative measures for each item or subject of interest. Attributes can be categorized into four types: nominal, ordinal, interval, and ratio (NOIR) [8]. Table 3.2 distinguishes these four attribute types and shows the operations they support. Nominal and ordinal attributes are considered categorical attributes, whereas interval and ratio attributes are considered numeric attributes.

Table 3.2 NOIR Attribute Types

 


Data of one attribute type may be converted to another. For example, the quality of diamonds {Fair, Good, Very Good, Premium, Ideal} is considered ordinal but can be converted to nominal {Good, Excellent} with a defined mapping. Similarly, a ratio attribute like Age can be converted into an ordinal attribute such as {Infant, Adolescent, Adult, Senior}. Understanding the attribute types in a given dataset is important to ensure that the appropriate descriptive statistics and analytic methods are applied and properly interpreted. For example, the mean and standard deviation of U.S. postal ZIP codes are not very meaningful or appropriate. Proper handling of categorical variables will be addressed in subsequent chapters. Also, it is useful to consider these attribute types during the following discussion on R data types.

Numeric, Character, and Logical Data Types

Numeric, Character, and Logical Data Types

i <- 1 # create a numeric variable

sport <- “football” # create a character variable

flag <- TRUE # create a logical variable

R provides several functions, such as class() and typeof(), to examine the characteristics of a given variable. The class() function represents the abstract class of an object. The typeof() function determines the way an object is stored in memory. Although i appears to be an integer, i is internally stored using double precision. To improve the readability of the code segments in this section, the inline R comments are used to explain the code or to provide the returned values.

class(i) # returns “numeric”

typeof(i) # returns “double”

class(sport) # returns “character”

typeof(sport) # returns “character”

class(flag) # returns “logical”

typeof(flag) # returns “logical”

Additional R functions exist that can test the variables and coerce a variable into a specific type. The following R code illustrates how to test if i is an integer using the is.integer() function and to coerce i into a new integer variable, j, using the as.integer() function. Similar functions can be applied for double, character, and logical types.

is.integer(i) # returns FALSE

j <- as.integer(i) # coerces contents of i into an integer

is.integer(j) # returns TRUE

The application of the length() function reveals that the created variables each have a length of 1. One might have expected the returned length of sport to have been 8 for each of the characters in the string “football”. However, these three variables are actually one element, vectors.

length(i) # returns 1

length(flag) # returns 1

length(sport) # returns 1 (not 8 for “football”)

Vectors

Vectors are a basic building block for data in R. As seen previously, simple R variables are actually vectors. A vector can only consist of values in the same class. The tests for vectors can be conducted using the is.vector() function.

is.vector(i) # returns TRUE

is.vector(flag) # returns TRUE

is.vector(sport) # returns TRUE

R provides functionality that enables the easy creation and manipulation of vectors. The following R code illustrates how a vector can be created using the combine function, c() or the colon operator, :, to build a vector from the sequence of integers from 1 to 5. Furthermore, the code shows how the values of an existing vector can be easily modified or accessed. The code, related to the z vector, indicates how logical comparisons can be built to extract certain elements of a given vector.

u <- c(“red”, “yellow”, “blue”) # create a vector “red” “yellow” “blue”

u # returns “red” “yellow” “blue”

u[1] # returns “red” (1st element in u)

v <- 1:5 # create a vector 1 2 3 4 5

v # returns 1 2 3 4 5

sum(v) # returns 15

w <- v * 2 # create a vector 2 4 6 8 10

w # returns 2 4 6 8 10

w[3] # returns 6 (the 3rd element of w)

w[3] # returns 6 (the 3rd element of w)

z # returns 3 6 9 12 15

z # returns 3 6 9 12 15

z[z > 8] # returns 9 12 15

z[z > 8 | z < 5] # returns 3 9 12 15 (“|” denotes “or”)

Sometimes it is necessary to initialize a vector of a specific length and then populate the content of the vector later. The vector() function, by default, creates a logical vector. A vector of a different type can be specified by using the mode parameter. The vector c, an integer vector of length 0, may be useful when the number of elements is not initially known and the new elements will later be added to the end of the vector as the values become available.

a <- vector(length=3) # create a logical vector of length 3

a # returns FALSE FALSE FALSE

b <- vector(mode=“numeric”, 3) # create a numeric vector of length 3

typeof(b) # returns “double”

b[2] <- 3.1 # assign 3.1 to the 2nd element

b # returns 0.0 3.1 0.0

c <- vector(mode=“integer”, 0) # create an integer vector of length 0

c # returns integer(0)

length(c) # returns 0

Although vectors may appear to be analogous to arrays of one dimension, they are technically dimensionless, as seen in the following R code. The concept of arrays and matrices is addressed in the following discussion.

length(b) # returns 3

dim(b) # returns NULL (an undefined value)

Arrays and Matrices

The array() function can be used to restructure a vector as an array. For example, the following R code builds a three-dimensional array to hold the quarterly sales for three regions over a two-year period and then assign the sales amount of $158,000 to the second region for the first quarter of the first year.

# the dimensions are 3 regions, 4 quarters, and 2 years

quarterly_sales <- array(0, dim=c(3,4,2))

quarterly_sales[2,1,1] <- 158000

quarterly_sales

, , 1

[,1] [,2] [,3] [,4]

[1,] 0 0 0 0

[2,] 158000 0 0 0

[3,] 0 0 0 0

, , 2

[,1] [,2] [,3] [,4]

[1,] 0 0 0 0

[2,] 0 0 0 0

[3,] 0 0 0 0

A two-dimensional array is known as a matrix. The following code initializes a matrix to hold the quarterly sales for the three regions. The parameters nrow and ncol define the number of rows and columns, respectively, for the sales_matrix.

sales_matrix <- matrix(0, nrow = 3, ncol = 4)

sales_matrix

[,1] [,2] [,3] [,4]

[1,] 0 0 0 0

[2,] 0 0 0 0

[3,] 0 0 0 0

R provides the standard matrix operations such as addition, subtraction, and multiplication, as well as the transpose function t() and the inverse matrix function matrix.inverse() included in the matrixcalc package. The following R code builds a 3 × 3 matrix, M, and multiplies it by its inverse to obtain the identity matrix.

library(matrixcalc)

M <- matrix(c(1,3,3,5,0,4,3,3,3),nrow = 3,ncol = 3) # build a 3x3 matrix

M %*% matrix.inverse(M) # multiply M by inverse(M)

[,1] [,2] [,3]

[1,] 1 0 0

[2,] 0 1 0

[3,] 0 0 1

 

Data Frames

Similar to the concept of matrices, data frames provide a structure for storing and accessing several variables of possibly different data types. In fact, as the is.data.frame() function indicates, a data frame was created by the read.csv() function at the beginning of the chapter.

#import a CSV file of the total annual sales for each customer

sales <- read.csv(“c:/data/yearly_sales.csv”)

is.data.frame(sales) # returns TRUE

As seen earlier, the variables stored in the data frame can be easily accessed using the $ notation. The following R code illustrates that in this example, each variable is a vector with the exception of gender, which was, by a read.csv() default, imported as a factor. Discussed in detail later in this section, a factor denotes a categorical variable, typically with a few finite levels such as “F” and “M” in the case of gender.

length(sales$num_of_orders) # returns 10000 (number of customers)

is.vector(sales$cust_id) # returns TRUE

is.vector(sales$sales_total) # returns TRUE

is.vector(sales$num_of_orders) # returns TRUE

is.vector(sales$gender) # returns FALSE

is.factor(sales$gender) # returns TRUE

Because of their flexibility to handle many data types, data frames are the preferred input format for many of the modeling functions available in R. The following use of the str() function provides the structure of the sales data frame. This function identifies the integer and numeric (double) data types, the factor variables and levels, as well as the first few values for each variable.

str(sales) # display structure of the data frame object

‘data.frame’: 10000 obs. of 4 variables:

$ cust_id : int 100001 100002 100003 100004 100005 100006 …

$ sales_total : num 800.6 217.5 74.6 498.6 723.1 …

$ num_of_orders: int 3 3 2 3 4 2 2 2 2 2 …

$ gender : Factor w/ 2 levels “F”,“M”: 1 1 2 2 1 1 2 2 1 2

In the simplest sense, data frames are lists of variables of the same length. A subset of the data frame can be retrieved through subsetting operators. R’s subsetting operators are powerful in that they allow one to express complex operations in a succinct fashion and easily retrieve a subset of the dataset.

# extract the fourth column of the sales data frame

sales[,4]

# extract the gender column of the sales data frame

sales$gender

# retrieve the first two rows of the data frame

sales[1:2,]

# retrieve the first, third, and fourth columns

sales[,c(1,3,4)]

# retrieve both the cust_id and the sales_total columns

sales[,c(“cust_id”, “sales_total”)]

# retrieve all the records whose gender is female

sales[sales$gender==“F”,]

The following R code shows that the class of the sales variable is a data frame. However, the type of the sales variable is a list. A list is a collection of objects that can be of various types, including other lists.

class(sales)

“data.frame”

typeof(sales)

“list”

 

Lists

Lists can contain any type of objects, including other lists. Using the vector v and the matrix M created in earlier examples, the following R code creates assortment, a list of different object types.

# build an assorted list of a string, a numeric, a list, a vector,

# and a matrix

housing <- list(“own”, “rent”)

assortment <- list(“football”, 7.5, housing, v, M)

assortment

[[1]]

[1] “football”

[[2]]

[1] 7.5

[[3]]

[[3]][[1]]

[1] “own”

[[3]][[2]]

[1] “rent”

[[4]]

[1] 1 2 3 4 5

[[5]]

[,1] [,2] [,3]

[1,] 1 5 3

[2,] 3 0 3

[3,] 3 4 3

In displaying the contents of assortment, the use of the double brackets, [[]], is of particular importance. As the following R code illustrates, the use of the single set of brackets only accesses an item in the list, not its content.

# examine the fifth object, M, in the list

class(assortment[5]) # returns “list”

length(assortment[5]) # returns 1

class(assortment[[5]]) # returns “matrix”

length(assortment[[5]]) # returns 9 (for the 3x3 matrix)

 

As presented earlier in the data frame discussion, the str() function offers details about the structure of a list.

str(assortment)

List of 5

$ : chr “football”

$ : num 7.5

$ :List of 2

..$ : chr “own”

..$ : chr “rent”

$ : int [1:5] 1 2 3 4 5

$ : num [1:3, 1:3] 1 3 3 5 0 4 3 3 3

Factors

Factors were briefly introduced during the discussion of the gender variable in the data frame sales. In this case, gender could assume one of two levels: F or M. Factors can be ordered or not ordered. In the case of gender, the levels are not ordered.

class(sales$gender) # returns “factor”

is.ordered(sales$gender) # returns FALSE

Included with the ggplot2 package, the diamonds data frame contains three ordered factors. Examining the cut factor, there are five levels in order of improving cut: Fair, Good, Very Good, Premium, and Ideal. Thus, sales$gender contains nominal data, and diamonds$cut contains ordinal data.

head(sales$gender) # display first six values and the levels

F F M M F F

Levels: F M

library(ggplot2)

data(diamonds) # load the data frame into the R workspace

str(diamonds)

‘data.frame’: 53940 obs. of 10 variables:

$ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 …

$ cut : Ord.factor w/ 5 levels “Fair”<“Good”<_x002e_.: _x0035_ _x0034_ _x0032_ _x0034_ _x0032_ _x0033_ _x2026__x003c__x0021_--EndFragment--> < . . : 5 4 2 4 2 3 …

$ color : Ord.factor w/ 7 levels “D”<“E”<“F”<“G”<_x002e_.: _x0032_ _x0032_ _x0032_ _x0036_ _x0037_ _x0037_ _x2026_> < . . : 2 2 2 6 7 7

$ clarity: Ord.factor w/ 8 levels “I1”<“SI2”<“SI1”<_x002e_.: _x0032_ _x0033_ _x0035_ _x0034_ _x0032_ _x2026_> < . . : 2 3 5 4 2 …

$ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 …

$ table : num 55 61 65 58 58 57 57 55 61 61 …

$ price : int 326 326 327 334 335 336 336 337 337 338 …

$ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 …

$ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 …

$ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 …

head(diamonds$cut) # display first six values and the levels

Ideal Premium Good Premium Good Very Good

Levels: Fair < Good < Very Good < Premium < Ideal

Suppose it is decided to categorize sales$sales_totals into three groups—small, medium, and big—according to the amount of the sales with the following code. These groupings are the basis for the new ordinal factor, spender, with levels {small, medium, big}.

# build an empty character vector of the same length as sales

sales_group <- vector(mode=“character”,

length=length(sales$sales_total))

# group the customers according to the sales amount

sales_group[sales$sales_total<100] ><- “small”

sales_group[sales$sales_total>=100 & sales$sales_total<500] ><- “medium”

sales_group[sales$sales_total>=500] <- “big”

# create and add the ordered factor to the sales data frame

spender <- factor(sales_group,levels=c(“small”, “medium”, “big”),

ordered = TRUE)

sales <- cbind(sales,spender)

str(sales$spender)

Ord.factor w/ 3 levels “small”<“medium”<_x002e_.: _x0033_ _x0032_ _x0031_ _x0032_ _x0033_ _x0031_ _x0031_ _x0031_ _x0032_ _x0031_ _x2026_> < . . : 3 2 1 2 3 1 1 1 2 1

head(sales$spender)

big medium small medium big small

Levels: small < medium < big

The cbind() function is used to combine variables column-wise. The rbind() function is used to combine datasets row-wise. The use of factors is important in several R statistical modeling functions, such as analysis of variance, aov(), presented later in this chapter, and the use of contingency tables, discussed next.

Contingency Tables

In R, table refers to a class of objects used to store the observed counts across the factors for a given dataset. Such a table is commonly referred to as a contingency table and is the basis for performing a statistical test on the independence of the factors used to build the table. The following R code builds a contingency table based on the sales$gender and sales$spender factors.

# build a contingency table based on the gender and spender factors

sales_table <- table(sales$gender,sales$spender)

sales_table

small medium big

F 1726 2746 563

M 1656 2723 586

class(sales_table) # returns “table”

typeof(sales_table) # returns “integer”

dim(sales_table) # returns 2 3

# performs a chi-squared test

summary(sales_table)

Number of cases in table: 10000

Number of factors: 2

Test for independence of all factors:

Chisq = 1.516, df = 2, p-value = 0.4686

Based on the observed counts in the table, the summary() function performs a chi-squared test on the independence of the two factors. Because the reported p-value is greater than 0.05, the assumed independence of the two factors is not rejected. Hypothesis testing and p-values are covered in more detail later in this chapter. Next, applying descriptive statistics in R is examined.

 

3.1.4 Descriptive Statistics

It has already been shown that the summary() function provides several descriptive statistics, such as the mean and median, about a variable such as the sales data frame. The results now include the counts for the three levels of the spender variable based on the earlier examples involving factors.

summary(sales)

cust_id sales_total num_of_orders gender spender

Min. :100001 Min. : 30.02 Min. : 1.000 F:5035 small :3382

1st Qu.:102501 1st Qu.: 80.29 1st Qu.: 2.000 M:4965 medium:5469

Median :105001 Median : 151.65 Median : 2.000 big :1149

Mean :105001 Mean : 249.46 Mean : 2.428

3rd Qu.:107500 3rd Qu.: 295.50 3rd Qu.: 3.000

Max. :110000 Max. :7606.09 Max. :22.000

The following code provides some common R functions that include descriptive statistics.

In parentheses, the comments describe the functions.

# to simplify the function calls, assign

x <- sales$sales_total

y <- sales$num_of_orders

cor(x,y) # returns 0.7508015 (correlation)

cov(x,y) # returns 345.2111 (covariance)

IQR(x) # returns 215.21 (interquartile range)

mean(x) # returns 249.4557 (mean)

median(x) # returns 151.65 (median)

range(x) # returns 30.02 7606.09 (min max)

sd(x) # returns 319.0508 (std. dev.)

var(x) # returns 101793.4 (variance)

The IQR() function provides the difference between the third and the first quartiles. The other functions are fairly self-explanatory by their names. The reader is encouraged to review the available help files for acceptable inputs and possible options.

The function apply() is useful when the same function is to be applied to several variables in a data frame. For example, the following R code calculates the standard deviation for the first three variables in sales. In the code, setting MARGIN=2 specifies that the sd() function is applied over the columns. Other functions, such as lapply() and sapply(), apply a function to a list or vector. Readers can refer to the R help files to learn how to use these functions.

apply(sales[,c(1:3)], MARGIN=2, FUN=sd)

cust_id sales_total num_of_orders

2886.895680 319.050782 1.441119

Additional descriptive statistics can be applied with user-defined functions. The following R code defines a function, my_range(), to compute the difference between the maximum and minimum values returned by the range() function. In general, user-defined functions are useful for any task or operation that needs to be frequently repeated. More information on user-defined functions is available by entering help(“function”) in the console.

# build a function to provide the difference between

# the maximum and the minimum values

my_range <- function(v) {range(v)[2] - range(v)[1]}

my_range(x)

7576.07

 

Tuesday, February 22, 2022

Data Science and Big Data Analytics

 

Discovering, Analyzing, Visualizing and Presenting Data

 

Data Analytics Lifecycle

2.4 Phase 3: Model Planning

In Phase 3, the data science team identifies candidate models to apply to the data for clustering, classifying, or finding relationships in the data depending on the goal of the project, as shown in Figure 2.5. It is during this phase that the team refers to the hypotheses developed in Phase 1, when they first became acquainted with the data and understanding the business problems or domain area. These hypotheses help the team frame the analytics to execute in Phase 4 and select the right methods to achieve its objectives.

 


Figure 2.5 Model planning phase

Some of the activities to consider in this phase include the following:

• Assess the structure of the datasets. The structure of the datasets is one factor that dictates the tools and analytical techniques for the next phase. Depending on whether the team plans to analyze textual data or transactional data, for example, different tools and approaches are required.

• Assess the structure of the datasets. The structure of the datasets is one factor that dictates the tools and analytical techniques for the next phase. Depending on whether the team plans to analyze textual data or transactional data, for example, different tools and approaches are required.

• Determine if the situation warrants a single model or a series of techniques as part of a larger analytic workflow. A few example models include association rules (Chapter 5, “Advanced Analytical Theory and Methods: Association Rules”) and logistic regression (Chapter 6, “Advanced Analytical Theory and Methods: Regression”). Other tools, such as Alpine Miner, enable users to set up a series of steps and analyses and can serve as a front-end user interface (UI) for manipulating Big Data sources in PostgreSQL.

In addition to the considerations just listed, it is useful to research and understand how other analysts generally approach a specific kind of problem. Given the kind of data and resources that are available, evaluate whether similar, existing approaches will work or if the team will need to create something new. Many times teams can get ideas from analogous problems that other people have solved in different industry verticals or domain areas. Table 2.2 summarizes the results of an exercise of this type, involving several domain areas and the types of models previously used in a classification type of problem after conducting research on churn models in multiple industry verticals. Performing this sort of diligence gives the team ideas of how others have solved similar problems and presents the team with a list of candidate models to try as part of the model planning phase.

Table 2.2 Research on Model Planning in Industry Verticals

 


2.4.1 Data Exploration and Variable Selection

Although some data exploration takes place in the data preparation phase, those activities focus mainly on data hygiene and on assessing the quality of the data itself. In Phase 3, the objective of the data exploration is to understand the relationships among the variables to inform selection of the variables and methods and to understand the problem domain. As with earlier phases of the Data Analytics Lifecycle, it is important to spend time and focus attention on this preparatory work to make the subsequent phases of model selection and execution easier and more efficient. A common way to conduct this step involves using tools to perform data visualizations. Approaching the data exploration in this way aids the team in previewing the data and assessing relationships between variables at a high level.

In many cases, stakeholders and subject matter experts have instincts and hunches about what the data science team should be considering and analyzing. Likely, this group had some hypothesis that led to the genesis of the project. Often, stakeholders have a good grasp of the problem and domain, although they may not be aware of the subtleties within the data or the model needed to accept or reject a hypothesis. Other times, stakeholders may be correct, but for the wrong reasons (for instance, they may be correct about a correlation that exists but infer an incorrect reason for the correlation). Meanwhile, data scientists have to approach problems with an unbiased mind-set and be ready to question all assumptions.

As the team begins to question the incoming assumptions and test initial ideas of the project sponsors and stakeholders, it needs to consider the inputs and data that will be needed, and then it must examine whether these inputs are actually correlated with the outcomes that the team plans to predict or analyze. Some methods and types of models will handle correlated variables better than others. Depending on what the team is attempting to solve, it may need to consider an alternate method, reduce the number of data inputs, or transform the inputs to allow the team to use the best method for a given business problem. Some of these techniques will be explored further in Chapter 3 and Chapter 6.

The key to this approach is to aim for capturing the most essential predictors and variables rather than considering every possible variable that people think may influence the outcome. Approaching the problem in this manner requires iterations and testing to identify the most essential variables for the intended analyses. The team should plan to test a range of variables to include in the model and then focus on the most important and influential variables.

If the team plans to run regression analyses, identify the candidate predictors and outcome variables of the model. Plan to create variables that determine outcomes but demonstrate a strong relationship to the outcome rather than to the other input variables. This includes remaining vigilant for problems such as serial correlation, multicollinearity, and other typical data modeling challenges that interfere with the validity of these models. Sometimes these issues can be avoided simply by looking at ways to reframe a given problem. In addition, sometimes determining correlation is all that is needed (“black box prediction”), and in other cases, the objective of the project is to understand the causal relationship better. In the latter case, the team wants the model to have explanatory power and needs to forecast or stress test the model under a variety of situations and with different datasets.

 

2.4.2 Model Selection

In the model selection subphase, the team’s main goal is to choose an analytical technique, or a short list of candidate techniques, based on the end goal of the project. For the context of this book, a model is discussed in general terms. In this case, a model simply refers to an abstraction from reality. One observes events happening in a real-world situation or with live data and attempts to construct models that emulate this behavior with a set of rules and conditions. In the case of machine learning and data mining, these rules and conditions are grouped into several general sets of techniques, such as classification, association rules, and clustering. When reviewing this list of types of potential models, the team can winnow down the list to several viable models to try to address a given problem. More details on matching the right models to common types of business problems are provided in Chapter 3 and Chapter 4, “Advanced Analytical Theory and Methods: Clustering.”

An additional consideration in this area for dealing with Big Data involves determining if the team will be using techniques that are best suited for structured data, unstructured data, or a hybrid approach. For instance, the team can leverage MapReduce to analyze unstructured data, as highlighted in Chapter 10. Lastly, the team should take care to identify and document the modeling assumptions it is making as it chooses and constructs preliminary models.

Typically, teams create the initial models using a statistical software package such as R, SAS, or Matlab. Although these tools are designed for data mining and machine learning algorithms, they may have limitations when applying the models to very large datasets, as is common with Big Data. As such, the team may consider redesigning these algorithms to run in the database itself during the pilot phase mentioned in Phase 6.

The team can move to the model building phase once it has a good idea about the type of model to try and the team has gained enough knowledge to refine the analytics plan. Advancing from this phase requires a general methodology for the analytical model, a solid understanding of the variables and techniques to use, and a description or diagram of the analytic workflow.

 

2.4.3 Common Tools for the Model Planning Phase

Many tools are available to assist in this phase. Here are several of the more common ones:

• R [14] has a complete set of modeling capabilities and provides a good environment for building interpretive models with high-quality code. In addition, it has the ability to interface with databases via an ODBC connection and execute statistical tests and analyses against Big Data via an open source connection. These two factors make R well suited to performing statistical tests and analytics on Big Data. As of this writing, R contains nearly 5,000 packages for data analysis and graphical representation. New packages are posted frequently, and many companies are providing value-add services for R (such as training, instruction, and best practices), as well as packaging it in ways to make it easier to use and more robust. This phenomenon is similar to what happened with Linux in the late 1980s and early 1990s, when companies appeared to package and make Linux easier for companies to consume and deploy. Use R with file extracts for offline analysis and optimal performance, and use RODBC connections for dynamic queries and faster development.

• SQL Analysis services [15] can perform in-database analytics of common data mining functions, involved aggregations, and basic predictive models.

• SAS/ACCESS [16] provides integration between SAS and the analytics sandbox via multiple data connectors such as OBDC, JDBC, and OLE DB. SAS itself is generally used on file extracts, but with SAS/ACCESS, users can connect to relational databases (such as Oracle or Teradata) and data warehouse appliances (such as Greenplum or Aster), files, and enterprise applications (such as SAP and Salesforce.com).

 

2.5 Phase 4: Model Building

In Phase 4, the data science team needs to develop datasets for training, testing, and production purposes. These datasets enable the data scientist to develop the analytical model and train it (“training data”), while holding aside some of the data (“hold-out data” or “test data”) for testing the model. (These topics are addressed in more detail in Chapter 3.) During this process, it is critical to ensure that the training and test datasets are sufficiently robust for the model and analytical techniques. A simple way to think of these datasets is to view the training dataset for conducting the initial experiments and the test sets for validating an approach once the initial experiments and models have been run.

In the model building phase, shown in Figure 2.6, an analytical model is developed and fit on the training data and evaluated (scored) against the test data. The phases of model planning and model building can overlap quite a bit, and in practice one can iterate back and forth between the two phases for a while before settling on a final model.

 


Figure 2.6 Model building phase

Although the modeling techniques and logic required to develop models can be highly complex, the actual duration of this phase can be short compared to the time spent preparing the data and defining the approaches. In general, plan to spend more time preparing and learning the data (Phases 1–2) and crafting a presentation of the findings (Phase 5). Phases 3 and 4 tend to move more quickly, although they are more complex from a conceptual standpoint.

As part of this phase, the data science team needs to execute the models defined in Phase 3.

During this phase, users run models from analytical software packages, such as R or SAS, on file extracts and small datasets for testing purposes. On a small scale, assess the validity of the model and its results. For instance, determine if the model accounts for most of the data and has robust predictive power. At this point, refine the models to optimize the results, such as by modifying variable inputs or reducing correlated variables where appropriate. In Phase 3, the team may have had some knowledge of correlated variables or problematic data attributes, which will be confirmed or denied once the models are actually executed. When immersed in the details of constructing models and transforming data, many small decisions are often made about the data and the approach for the modeling. These details can be easily forgotten once the project is completed. Therefore, it is vital to record the results and logic of the model during this phase. In addition, one must take care to record any operating assumptions that were made in the modeling process regarding the data or the context.

Creating robust models that are suitable to a specific situation requires thoughtful consideration to ensure the models being developed ultimately meet the objectives outlined in Phase 1. Questions to consider include these:

• Does the model appear valid and accurate on the test data?

• Does the model output/behavior make sense to the domain experts? That is, does it appear as if the model is giving answers that make sense in this context?

• Do the parameter values of the fitted model make sense in the context of the domain?

• Is the model sufficiently accurate to meet the goal?

• Does the model avoid intolerable mistakes? Depending on context, false positives may be more serious or less serious than false negatives, for instance. (False positives and false negatives are discussed further in Chapter 3 and Chapter 7, “Advanced Analytical Theory and Methods: Classification.”)

• Are more data or more inputs needed? Do any of the inputs need to be transformed or eliminated?

• Will the kind of model chosen support the runtime requirements?

• Is a different form of the model required to address the business problem? If so, go back to the model planning phase and revise the modeling approach.

Once the data science team can evaluate either if the model is sufficiently robust to solve the problem or if the team has failed, it can move to the next phase in the Data Analytics Lifecycle.

 

2.5.1 Common Tools for the Model Building Phase

There are many tools available to assist in this phase, focused primarily on statistical analysis or data mining software. Common tools in this space include, but are not limited to, the following:

• Commercial Tools:

◦ SAS Enterprise Miner [17] allows users to run predictive and descriptive models based on large volumes of data from across the enterprise. It interoperates with other large data stores, has many partnerships, and is built for enterprise-level computing and analytics.

◦ SPSS Modeler [18] (provided by IBM and now called IBM SPSS Modeler) offers methods to explore and analyze data through a GUI.

◦ Matlab [19] provides a high-level language for performing a variety of data analytics, algorithms, and data exploration.

◦ Alpine Miner [11] provides a GUI front end for users to develop analytic workflows and interact with Big Data tools and platforms on the back end.

◦ STATISTICA [20] and Mathematica [21] are also popular and well-regarded data mining and analytics tools.

• Free or Open Source tools:

◦ R and PL/R [14] R was described earlier in the model planning phase, and PL/R is a procedural language for PostgreSQL with R. Using this approach means that R commands can be executed in database. This technique provides higher performance and is more scalable than running R in memory.

◦ Octave [22], a free software programming language for computational modeling, has some of the functionality of Matlab. Because it is freely available, Octave is used in major universities when teaching machine learning.

◦ WEKA [23] is a free data mining software package with an analytic workbench. The functions created in WEKA can be executed within Java code.

◦ Python is a programming language that provides toolkits for machine learning and analysis, such as scikit-learn, numpy, scipy, pandas, and related data visualization using matplotlib.

◦ SQL in-database implementations, such as MADlib [24], provide an alterative to in-memory desktop analytical tools. MADlib provides an open-source machine learning library of algorithms that can be executed in-database, for PostgreSQL or Greenplum.

 

2.6 Phase 5: Communicate Results

After executing the model, the team needs to compare the outcomes of the modeling to the criteria established for success and failure. In Phase 5, shown in Figure 2.7, the team considers how best to articulate the findings and outcomes to the various team members and stakeholders, taking into account caveats, assumptions, and any limitations of the results. Because the presentation is often circulated within an organization, it is critical to articulate the results properly and position the findings in a way that is appropriate for the audience.

 



Figure 2.7 Communicate results phase

As part of Phase 5, the team needs to determine if it succeeded or failed in its objectives. Many times people do not want to admit to failing, but in this instance failure should not be considered as a true failure, but rather as a failure of the data to accept or reject a given hypothesis adequately. This concept can be counterintuitive for those who have been told their whole careers not to fail. However, the key is to remember that the team must be rigorous enough with the data to determine whether it will prove or disprove the hypotheses outlined in Phase 1 (discovery). Sometimes teams have only done a superficial analysis, which is not robust enough to accept or reject a hypothesis. Other times, teams perform very robust analysis and are searching for ways to show results, even when results may not be there. It is important to strike a balance between these two extremes when it comes to analyzing data and being pragmatic in terms of showing real-world results.

When conducting this assessment, determine if the results are statistically significant and valid. If they are, identify the aspects of the results that stand out and may provide salient findings when it comes time to communicate them. If the results are not valid, think about adjustments that can be made to refine and iterate on the model to make it valid. During this step, assess the results and identify which data points may have been surprising and which were in line with the hypotheses that were developed in Phase 1. Comparing the actual results to the ideas formulated early on produces additional ideas and insights that would have been missed if the team had not taken time to formulate initial hypotheses early in the process.

By this time, the team should have determined which model or models address the analytical challenge in the most appropriate way. In addition, the team should have ideas of some of the findings as a result of the project. The best practice in this phase is to record all the findings and then select the three most significant ones that can be shared with the stakeholders. In addition, the team needs to reflect on the implications of these findings and measure the business value. Depending on what emerged as a result of the model, the team may need to spend time quantifying the business impact of the results to help prepare for the presentation and demonstrate the value of the findings. Doug Hubbard’s work [6] offers insights on how to assess intangibles in business and quantify the value of seemingly unmeasurable things.

Now that the team has run the model, completed a thorough discovery phase, and learned a great deal about the datasets, reflect on the project and consider what obstacles were in the project and what can be improved in the future. Make recommendations for future work or improvements to existing processes, and consider what each of the team members and stakeholders needs to fulfill her responsibilities. For instance, sponsors must champion the project. Stakeholders must understand how the model affects their processes. (For example, if the team has created a model to predict customer churn, the Marketing team must understand how to use the churn model predictions in planning their interventions.) Production engineers need to operationalize the work that has been done. In addition, this is the phase to underscore the business benefits of the work and begin making the case to implement the logic into a live production environment.

As a result of this phase, the team will have documented the key findings and major insights derived from the analysis. The deliverable of this phase will be the most visible portion of the process to the outside stakeholders and sponsors, so take care to clearly articulate the results, methodology, and business value of the findings. More details will be provided about data visualization tools and references in Chapter 12, “The Endgame, or Putting It All Together.”

 

2.7 Phase 6: Operationalize

In the final phase, the team communicates the benefits of the project more broadly and sets up a pilot project to deploy the work in a controlled way before broadening the work to a full enterprise or ecosystem of users. In Phase 4, the team scored the model in the analytics sandbox. Phase 6, shown in Figure 2.8, represents the first time that most analytics teams approach deploying the new analytical methods or models in a production environment. Rather than deploying these models immediately on a wide-scale basis, the risk can be managed more effectively and the team can learn by undertaking a small scope, pilot deployment before a wide-scale rollout. This approach enables the team to learn about the performance and related constraints of the model in a production environment on a small scale and make adjustments before a full deployment. During the pilot project, the team may need to consider executing the algorithm in the database rather than with in-memory tools such as R because the run time is significantly faster and more efficient than running in-memory, especially on larger datasets.

 


Figure 2.8 Model operationalize phase

While scoping the effort involved in conducting a pilot project, consider running the model in a production environment for a discrete set of products or a single line of business, which tests the model in a live setting. This allows the team to learn from the deployment and make any needed adjustments before launching the model across the enterprise. Be aware that this phase can bring in a new set of team members—usually the engineers responsible for the production environment who have a new set of issues and concerns beyond those of the core project team. This technical group needs to ensure that running the model fits smoothly into the production environment and that the model can be integrated into related business processes.

Part of the operationalizing phase includes creating a mechanism for performing ongoing monitoring of model accuracy and, if accuracy degrades, finding ways to retrain the model. If feasible, design alerts for when the model is operating “out-of-bounds.” This includes situations when the inputs are beyond the range that the model was trained on, which may cause the outputs of the model to be inaccurate or invalid. If this begins to happen regularly, the model needs to be retrained on new data.

Often, analytical projects yield new insights about a business, a problem, or an idea that people may have taken at face value or thought was impossible to explore. Four main deliverables can be created to meet the needs of most stakeholders. This approach for developing the four deliverables is discussed in greater detail in Chapter 12.

Figure 2.9 portrays the key outputs for each of the main stakeholders of an analytics project and what they usually expect at the conclusion of a project.

• Business User typically tries to determine the benefits and implications of the findings to the business.

• Project Sponsor typically asks questions related to the business impact of the project, the risks and return on investment (ROI), and the way the project can be evangelized within the organization (and beyond).

• Project Manager needs to determine if the project was completed on time and within budget and how well the goals were met.

• Business Intelligence Analyst needs to know if the reports and dashboards he manages will be impacted and need to change.

• Data Engineer and Database Administrator (DBA) typically need to share their code from the analytics project and create a technical document on how to implement it.

• Data Scientist needs to share the code and explain the model to her peers, managers, and other stakeholders.

 



Figure 2.9 Key outputs from a successful analytics project

Although these seven roles represent many interests within a project, these interests usually overlap, and most of them can be met with four main deliverables.

• Presentation for project sponsors: This contains high-level takeaways for executive level stakeholders, with a few key messages to aid their decision-making process. Focus on clean, easy visuals for the presenter to explain and for the viewer to grasp.

• Presentation for analysts, which describes business process changes and reporting changes. Fellow data scientists will want the details and are comfortable with technical graphs (such as Receiver Operating Characteristic [ROC] curves, density plots, and histograms shown in Chapter 3 and Chapter 7).

• Code for technical people.

• Technical specifications of implementing the code.

As a general rule, the more executive the audience, the more succinct the presentation needs to be. Most executive sponsors attend many briefings in the course of a day or a week. Ensure that the presentation gets to the point quickly and frames the results in terms of value to the sponsor’s organization. For instance, if the team is working with a bank to analyze cases of credit card fraud, highlight the frequency of fraud, the number of cases in the past month or year, and the cost or revenue impact to the bank (or focus on the reverse —how much more revenue the bank could gain if it addresses the fraud problem). This demonstrates the business impact better than deep dives on the methodology. The presentation needs to include supporting information about analytical methodology and data sources, but generally only as supporting detail or to ensure the audience has confidence in the approach that was taken to analyze the data.

When presenting to other audiences with more quantitative backgrounds, focus more time on the methodology and findings. In these instances, the team can be more expansive in describing the outcomes, methodology, and analytical experiment with a peer group. This audience will be more interested in the techniques, especially if the team developed a new way of processing or analyzing data that can be reused in the future or applied to similar problems. In addition, use imagery or data visualization when possible. Although it may take more time to develop imagery, people tend to remember mental pictures to demonstrate a point more than long lists of bullets [25]. Data visualization and presentations are discussed further in Chapter 12.

 

2.8 Case Study: Global Innovation Network and Analysis (GINA)

EMC’s Global Innovation Network and Analytics (GINA) team is a group of senior technologists located in centers of excellence (COEs) around the world. This team’s charter is to engage employees across global COEs to drive innovation, research, and university partnerships. In 2012, a newly hired director wanted to improve these activities and provide a mechanism to track and analyze the related information. In addition, this team wanted to create more robust mechanisms for capturing the results of its informal conversations with other thought leaders within EMC, in academia, or in other organizations, which could later be mined for insights.

The GINA team thought its approach would provide a means to share ideas globally and increase knowledge sharing among GINA members who may be separated geographically. It planned to create a data repository containing both structured and unstructured data to accomplish three main goals.

• Store formal and informal data.

• Track research from global technologists.

• Mine the data for patterns and insights to improve the team’s operations and strategy.

The GINA case study provides an example of how a team applied the Data Analytics Lifecycle to analyze innovation data at EMC. Innovation is typically a difficult concept to measure, and this team wanted to look for ways to use advanced analytical methods to identify key innovators within the company.

 

2.8.1 Phase 1: Discovery

In the GINA project’s discovery phase, the team began identifying data sources. Although GINA was a group of technologists skilled in many different aspects of engineering, it had some data and ideas about what it wanted to explore but lacked a formal team that could perform these analytics. After consulting with various experts including Tom Davenport, a noted expert in analytics at Babson College, and Peter Gloor, an expert in collective intelligence and creator of CoIN (Collaborative Innovation Networks) at MIT, the team decided to crowdsource the work by seeking volunteers within EMC.

Here is a list of how the various roles on the working team were fulfilled.

• Business User, Project Sponsor, Project Manager: Vice President from Office of the CTO

• Business Intelligence Analyst: Representatives from IT

• Data Engineer and Database Administrator (DBA): Representatives from IT

• Data Scientist: Distinguished Engineer, who also developed the social graphs shown in the GINA case study

The project sponsor’s approach was to leverage social media and blogging [26] to accelerate the collection of innovation and research data worldwide and to motivate teams of “volunteer” data scientists at worldwide locations. Given that he lacked a formal team, he needed to be resourceful about finding people who were both capable and willing to volunteer their time to work on interesting problems. Data scientists tend to be passionate about data, and the project sponsor was able to tap into this passion of highly talented people to accomplish challenging work in a creative way.

The data for the project fell into two main categories. The first category represented five years of idea submissions from EMC’s internal innovation contests, known as the Innovation Roadmap (formerly called the Innovation Showcase). The Innovation Roadmap is a formal, organic innovation process whereby employees from around the globe submit ideas that are then vetted and judged. The best ideas are selected for further incubation. As a result, the data is a mix of structured data, such as idea counts, submission dates, inventor names, and unstructured content, such as the textual descriptions of the ideas themselves.

The second category of data encompassed minutes and notes representing innovation and research activity from around the world. This also represented a mix of structured and unstructured data. The structured data included attributes such as dates, names, and geographic locations. The unstructured documents contained the “who, what, when, and where” information that represents rich data about knowledge growth and transfer within the company. This type of information is often stored in business silos that have little to no visibility across disparate research teams.

The 10 main IHs that the GINA team developed were as follows:

• IH1: Innovation activity in different geographic regions can be mapped to corporate strategic directions.

• IH2: The length of time it takes to deliver ideas decreases when global knowledge transfer occurs as part of the idea delivery process.

• IH3: Innovators who participate in global knowledge transfer deliver ideas more quickly than those who do not.

• IH4: An idea submission can be analyzed and evaluated for the likelihood of receiving funding.

• IH5: Knowledge discovery and growth for a particular topic can be measured and compared across geographic regions.

• IH6: Knowledge transfer activity can identify research-specific boundary spanners in disparate regions.

• IH7: Strategic corporate themes can be mapped to geographic regions.

• IH8: Frequent knowledge expansion and transfer events reduce the time it takes to generate a corporate asset from an idea.

• IH9: Lineage maps can reveal when knowledge expansion and transfer did not (or has not) resulted in a corporate asset.

• IH10: Emerging research topics can be classified and mapped to specific ideators, innovators, boundary spanners, and assets.

The GINA (IHs) can be grouped into two categories:

• Descriptive analytics of what is currently happening to spark further creativity, collaboration, and asset generation

• Predictive analytics to advise executive management of where it should be investing in the future

 

2.8.2 Phase 2: Data Preparation

The team partnered with its IT department to set up a new analytics sandbox to store and experiment on the data. During the data exploration exercise, the data scientists and data engineers began to notice that certain data needed conditioning and normalization. In addition, the team realized that several missing datasets were critical to testing some of the analytic hypotheses.

As the team explored the data, it quickly realized that if it did not have data of sufficient quality or could not get good quality data, it would not be able to perform the subsequent steps in the lifecycle process. As a result, it was important to determine what level of data quality and cleanliness was sufficient for the project being undertaken. In the case of the GINA, the team discovered that many of the names of the researchers and people interacting with the universities were misspelled or had leading and trailing spaces in the datastore. Seemingly small problems such as these in the data had to be addressed in this phase to enable better analysis and data aggregation in subsequent phases.

 

2.8.3 Phase 3: Model Planning

In the GINA project, for much of the dataset, it seemed feasible to use social network analysis techniques to look at the networks of innovators within EMC. In other cases, it was difficult to come up with appropriate ways to test hypotheses due to the lack of data. In one case (IH9), the team made a decision to initiate a longitudinal study to begin tracking data points over time regarding people developing new intellectual property. This data collection would enable the team to test the following two ideas in the future:

• IH8: Frequent knowledge expansion and transfer events reduce the amount of time it takes to generate a corporate asset from an idea.

• IH9: Lineage maps can reveal when knowledge expansion and transfer did not (or has not) result(ed) in a corporate asset.

For the longitudinal study being proposed, the team needed to establish goal criteria for the study. Specifically, it needed to determine the end goal of a successful idea that had traversed the entire journey. The parameters related to the scope of the study included the following considerations:

• Identify the right milestones to achieve this goal.

• Trace how people move ideas from each milestone toward the goal.

• Once this is done, trace ideas that die, and trace others that reach the goal. Compare the journeys of ideas that make it and those that do not.

• Compare the times and the outcomes using a few different methods (depending on how the data is collected and assembled). These could be as simple as t-tests or perhaps involve different types of classification algorithms.

 

2.8.4 Phase 4: Model Building

In Phase 4, the GINA team employed several analytical methods. This included work by the data scientist using Natural Language Processing (NLP) techniques on the textual descriptions of the Innovation Roadmap ideas. In addition, he conducted social network analysis using R and RStudio, and then he developed social graphs and visualizations of the network of communications related to innovation using R’s ggplot2 package. Examples of this work are shown in Figures 2.10 and 2.11.

 



Figure 2.10 Social graph [27] visualization of idea submitters and finalists

 



Figure 2.11 Social graph visualization of top innovation influencers

Figure 2.10 shows social graphs that portray the relationships between idea submitters within GINA. Each color represents an innovator from a different country. The large dots with red circles around them represent hubs. A hub represents a person with high connectivity and a high “betweenness” score. The cluster in Figure 2.11 contains geographic variety, which is critical to prove the hypothesis about geographic boundary spanners. One person in this graph has an unusually high score when compared to the rest of the nodes in the graph. The data scientist identified this person and ran a query against his name within the analytic sandbox. These actions yielded the following information about this research scientist (from the social graph), which illustrated how influential he was within his business unit and across many other areas of the company worldwide:

• In 2011, he attended the ACM SIGMOD conference, which is a top-tier conference on large-scale data management problems and databases.

• He visited employees in France who are part of the business unit for EMC’s content management teams within Documentum (now part of the Information Intelligence Group, or IIG).

• He presented his thoughts on the SIGMOD conference at a virtual brownbag session attended by three employees in Russia, one employee in Cairo, one employee in Ireland, one employee in India, three employees in the United States, and one employee in Israel.

• In 2012, he attended the SDM 2012 conference in California.

• On the same trip he visited innovators and researchers at EMC federated companies, Pivotal and VMware.

• Later on that trip he stood before an internal council of technology leaders and introduced two of his researchers to dozens of corporate innovators and researchers.

This finding suggests that at least part of the initial hypothesis is correct; the data can identify innovators who span different geographies and business units. The team used Tableau software for data visualization and exploration and used the Pivotal Greenplum database as the main data repository and analytics engine.

 

2.8.5 Phase 5: Communicate Results

In Phase 5, the team found several ways to cull results of the analysis and identify the most impactful and relevant findings. This project was considered successful in identifying boundary spanners and hidden innovators. As a result, the CTO office launched longitudinal studies to begin data collection efforts and track innovation results over longer periods of time. The GINA project promoted knowledge sharing related to innovation and researchers spanning multiple areas within the company and outside of it. GINA also enabled EMC to cultivate additional intellectual property that led to additional research topics and provided opportunities to forge relationships with universities for joint academic research in the fields of Data Science and Big Data. In addition, the project was accomplished with a limited budget, leveraging a volunteer force of highly skilled and distinguished engineers and data scientists.

One of the key findings from the project is that there was a disproportionately high density of innovators in Cork, Ireland. Each year, EMC hosts an innovation contest, open to employees to submit innovation ideas that would drive new value for the company. When looking at the data in 2011, 15% of the finalists and 15% of the winners were from Ireland. These are unusually high numbers, given the relative size of the Cork COE compared to other larger centers in other parts of the world. After further research, it was learned that the COE in Cork, Ireland had received focused training in innovation from an external consultant, which was proving effective. The Cork COE came up with more innovation ideas, and better ones, than it had in the past, and it was making larger contributions to innovation at EMC. It would have been difficult, if not impossible, to identify this cluster of innovators through traditional methods or even anecdotal, word-of[1]mouth feedback. Applying social network analysis enabled the team to find a pocket of people within EMC who were making disproportionately strong contributions. These findings were shared internally through presentations and conferences and promoted through social media and blogs.

 

2.8.6 Phase 6: Operationalize

Running analytics against a sandbox filled with notes, minutes, and presentations from innovation activities yielded great insights into EMC’s innovation culture. Key findings from the project include these:

• The CTO office and GINA need more data in the future, including a marketing initiative to convince people to inform the global community on their innovation/research activities.

• Some of the data is sensitive, and the team needs to consider security and privacy related to the data, such as who can run the models and see the results.

• In addition to running models, a parallel initiative needs to be created to improve basic Business Intelligence activities, such as dashboards, reporting, and queries on research activities worldwide.

• A mechanism is needed to continually reevaluate the model after deployment. Assessing the benefits is one of the main goals of this stage, as is defining a process to retrain the model as needed.

In addition to the actions and findings listed, the team demonstrated how analytics can drive new insights in projects that are traditionally difficult to measure and quantify. This project informed investment decisions in university research projects by the CTO office and identified hidden, high-value innovators. In addition, the CTO office developed tools to help submitters improve ideas using topic modeling as part of new recommender systems to help idea submitters find similar ideas and refine their proposals for new intellectual property.

Table 2.3 outlines an analytics plan for the GINA case study example. Although this project shows only three findings, there were many more. For instance, perhaps the biggest overarching result from this project is that it demonstrated, in a concrete way, that analytics can drive new insights in projects that deal with topics that may seem difficult to measure, such as innovation.

Table 2.3 Analytic Plan from the EMC GINA Project

 


Innovation is an idea that every company wants to promote, but it can be difficult to measure innovation or identify ways to increase innovation. This project explored this issue from the standpoint of evaluating informal social networks to identify boundary spanners and influential people within innovation subnetworks. In essence, this project took a seemingly nebulous problem and applied advanced analytical methods to tease out answers using an objective, fact-based approach.

Another outcome from the project included the need to supplement analytics with a separate datastore for Business Intelligence reporting, accessible to search innovation/research initiatives. Aside from supporting decision making, this will provide a mechanism to be informed on discussions and research happening worldwide among team members in disparate locations. Finally, it highlighted the value that can be gleaned through data and subsequent analysis. Therefore, the need was identified to start formal marketing programs to convince people to submit (or inform) the global community on their innovation/research activities. The knowledge sharing was critical. Without it, GINA would not have been able to perform the analysis and identify the hidden innovators within the company.

 

Summary

This chapter described the Data Analytics Lifecycle, which is an approach to managing and executing analytical projects. This approach describes the process in six phases.

1. Discovery

2. Data preparation

3. Model planning

4. Model building

5. Communicate results

6. Operationalize

Through these steps, data science teams can identify problems and perform rigorous investigation of the datasets needed for in-depth analysis. As stated in the chapter, although much is written about the analytical methods, the bulk of the time spent on these kinds of projects is spent in preparation—namely, in Phases 1 and 2 (discovery and data preparation). In addition, this chapter discussed the seven roles needed for a data science team. It is critical that organizations recognize that Data Science is a team effort, and a balance of skills is needed to be successful in tackling Big Data projects and other complex projects involving data analytics.

 

Exercises

1. In which phase would the team expect to invest most of the project time? Why? Where would the team expect to spend the least time?

2.  What are the benefits of doing a pilot program before a full-scale rollout of a new analytical methodology? Discuss this in the context of the mini case study.

3. What kinds of tools would be used in the following phases, and for which kinds of use scenarios?

1. Phase 2: Data preparation

2. Phase 4: Model building

 

Bibliography

1. [1] T. H. Davenport and D. J. Patil, “Data Scientist: The Sexiest Job of the 21st Century,” Harvard Business Review, October 2012.

2. [2] J. Manyika, M. Chiu, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers, “Big Data: The Next Frontier for Innovation, Competition, and Productivity,” McKinsey Global Institute, 2011.

3. [3] “Scientific Method” [Online]. Available: http://en.wikipedia.org/wiki/Scientific_method.

4. [4] “CRISP-DM” [Online]. Available: http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

5. [5] T. H. Davenport, J. G. Harris, and R. Morison, Analytics at Work: Smarter Decisions, Better Results, 2010, Harvard Business Review Press.

6. [6] D. W. Hubbard, How to Measure Anything: Finding the Value of Intangibles in Business, 2010, Hoboken, NJ: John Wiley & Sons.

7. [7] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein and C. Welton, MAD Skills: New Analysis Practices for Big Data, Watertown, MA 2009.

8. [8] “List of APIs” [Online]. Available: http://www.programmableweb.com/apis.

9. [9] B. Shneiderman [Online]. Available: http://www.ifp.illinois.edu/nabhcs/abstracts/shneiderman.html.

10. [10] “Hadoop” [Online]. Available: http://hadoop.apache.org.

11. [11] “Alpine Miner” [Online]. Available: http://alpinenow.com.

12. [12] “OpenRefine” [Online]. Available: http://openrefine.org.

13. [13] “Data Wrangler” [Online]. Available: http://vis.stanford.edu/wrangler/.

14. [14] “CRAN” [Online]. Available: http://cran.us.r-project.org.

15. [15] “SQL” [Online]. Available: http://en.wikipedia.org/wiki/SQL.

16. [16] a“SAS/ACCESS” [Online]. Available: http://www.sas.com/en_us/software/data-management/access.htm.

17. [17] “SAS Enterprise Miner” [Online]. Available: http://www.sas.com/en_us/software/analytics/enterprise-miner.html.

18. [18] “SPSS Modeler” [Online]. Available: http://www[1]03.ibm.com/software/products/en/category/business-analytics.

19. [19] “Matlab” [Online]. Available: http://www.mathworks.com/products/matlab/.

20. [20] “Statistica” [Online]. Available: https://www.statsoft.com.

21. [21] “Mathematica” [Online]. Available: http://www.wolfram.com/mathematica/.

22. [22] “Octave” [Online]. Available: https://www.gnu.org/software/octave/.

23. [23] “WEKA” [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/.

24. [24] “MADlib” [Online]. Available: http://madlib.net.

25. [25] K. L. Higbee, Your Memory—How It Works and How to Improve It, New York: Marlowe & Company, 1996.

26. [26] S. Todd, “Data Science and Big Data Curriculum” [Online]. Available: http://stevetodd.typepad.com/my_weblog/data-science-and-big-data[1]curriculum/.

27. [27] T. H Davenport and D. J. Patil, “Data Scientist: The Sexiest Job of the 21st Century,” Harvard Business Review, October 2012.

 

Labels

ACTUATORS (10) AIR CONTROL/MEASUREMENT (38) ALARMS (20) ALIGNMENT SYSTEMS (2) Ammeters (12) ANALYSERS/ANALYSIS SYSTEMS (33) ANGLE MEASUREMENT/EQUIPMENT (5) APPARATUS (6) Articles (3) AUDIO MEASUREMENT/EQUIPMENT (1) BALANCES (4) BALANCING MACHINES/SERVICES (1) BOILER CONTROLS/ACCESSORIES (5) BRIDGES (7) CABLES/CABLE MEASUREMENT (14) CALIBRATORS/CALIBRATION EQUIPMENT (19) CALIPERS (3) CARBON ANALYSERS/MONITORS (5) CHECKING EQUIPMENT/ACCESSORIES (8) CHLORINE ANALYSERS/MONITORS/EQUIPMENT (1) CIRCUIT TESTERS CIRCUITS (2) CLOCKS (1) CNC EQUIPMENT (1) COIL TESTERS EQUIPMENT (4) COMMUNICATION EQUIPMENT/TESTERS (1) COMPARATORS (1) COMPASSES (1) COMPONENTS/COMPONENT TESTERS (5) COMPRESSORS/COMPRESSOR ACCESSORIES (2) Computers (1) CONDUCTIVITY MEASUREMENT/CONTROL (3) CONTROLLERS/CONTROL SYTEMS (35) CONVERTERS (2) COUNTERS (4) CURRENT MEASURMENT/CONTROL (2) Data Acquisition Addon Cards (4) DATA ACQUISITION SOFTWARE (5) DATA ACQUISITION SYSTEMS (22) DATA ANALYSIS/DATA HANDLING EQUIPMENT (1) DC CURRENT SYSTEMS (2) DETECTORS/DETECTION SYSTEMS (3) DEVICES (1) DEW MEASURMENT/MONITORING (1) DISPLACEMENT (2) DRIVES (2) ELECTRICAL/ELECTRONIC MEASUREMENT (3) ENCODERS (1) ENERGY ANALYSIS/MEASUREMENT (1) EQUIPMENT (6) FLAME MONITORING/CONTROL (5) FLIGHT DATA ACQUISITION and ANALYSIS (1) FREQUENCY MEASUREMENT (1) GAS ANALYSIS/MEASURMENT (1) GAUGES/GAUGING EQUIPMENT (15) GLASS EQUIPMENT/TESTING (2) Global Instruments (1) Latest News (35) METERS (1) SOFTWARE DATA ACQUISITION (2) Supervisory Control - Data Acquisition (1)