Industries Needs: Fundamentals of Data Science for Future Data Scientists

1. Data, Data Types, and Big Data

Data have been frequently discussed along with two other concepts: information and knowledge. However, the concept of data seems less ambiguous than that of information or knowledge. Data are considered to be symbols, or raw facts which have not yet been processed (Ackoff, 1989; Coronel, Morris, & Rob, 2012, p.5). In contrast, information has a number of different definitions, such as information as a thing, something informative, a process, or equivalent to knowledge (Buckland, 1991; Losee, 1997; Saracevic, 1999; Madden, 2000). Knowledge also does not have an agreed-upon definition. Davenport and Prusak (1998) viewed knowledge as “a fluid mix of framed experience, values, contextual information, and expert insight that provides a framework for evaluating and incorporating new experiences and information.” (Davenport & Prusak, 1998, p.5).

The three concepts of data, information, and knowledge are related. They are products of human intellectual activity at different cognitive levels. Data usually need to be processed and organized to become informative for human beings, while knowledge is obtained by making use of data and information through experiences. There is a fine line between data and information: when we talk about digital information or digital data, they are sometimes used interchangeably.

To facilitate our discussion, we follow the idea that data are symbols or raw facts that are stored on electronic media. They can be produced or collected manually by humans or automatically by instruments. The following are examples of data in different forms and formats:

• Scientific data. Scientific data are those collected by scientists or scientificl instruments during observations or experiments. Examples of scientific data include astronomical data collected by telescopes, patients’ vitals collected by heartbeat monitors, and laboratory data collected by biologists or physicists;

• Transaction data. Transaction data are those collected by companies to recordl their business activities, such as sales records, product specifics, and customer profiles;

• Data that is of public interest. Data that reflect environmental status, humanl activities and social change, such as geographical data, and weather forecasts.

• Web data. Web data are documents, webpages, images, videos published orl released on the Internet. These data also record human activities and knowledge. They can be raw facts for certain tasks or decision making processes; Social media data.

• Social media data are those generated by the Internet orl mobile users, such as tweets or postings on social media sites, or comments to others’ postings.

Data can also be classified based on different facets depending on different perspectives. For examples, scientific data can be numeric, textual, or graphical. Or, we can categorize data as structured or unstructured based on whether they have been organized. Data type is an important concept in this regard. It has been a basic concept in Computer Science - a computer programmer knows that one needs to determine the data type of a particular variable before applying appropriate operations to it. Computer languages can usually handle different types of data. For example, Python, one of the most popular computer programming languages for data analysis, contains simple data types including numerical data type, string and bytes, and collection data types such as tuple, list, set, and dictionary. Data might need to be changed to a different type in order to be processed properly.

Most data we talked in this chapter are actually digital data in electronic form. Data are generated every day and at every moment. One of the characteristics of data is that it can be copied easily and precisely, or transferred from one medium to another with great speed and accuracy. However, that is not always true, especially when we start to deal with big data. Big data refer to digital data with three Vs: volume, variety, and velocity (Laney, 2011). Volume refers to the situation that data grows at a rapid rate, variety refers to the many data types in which data exist, and velocity refers to the speed of data delivery and processing. Dealing with big data requires theories, methods, technologies, and tools, which lead to the emergence of Data Science, a new discipline that focuses on data processing, which will be discussed in the remaining sections.

Data are important resources for organizations and individuals who use them to make decisions. When big data becomes publicly available, no one can ignore the huge potential impact of these data on business, education, and personal lives. Therefore, Data Science as a new discipline has drawn great attention from governments, industry, and educational institutions.

2. Data Science and Data Scientists

Dr. William S. Cleveland formally defined Data Science in 2001 as it is used today in his article “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.” (Cleveland, 2001). Since then, the concept of Data Science has been further developed and has been increasingly linked to data analytics and big data. Companies have realized the value of Data Science and its usefulness in discovering knowledge and helping with decision-making. The need for Data Science workers, or data scientists, has increased tremendously in recent years (Leopold, 2017).

2.1 Defining Data Science: Different Perspectives

Providing a precise definition for a discipline is important, especially for students who are eager to acquire important knowledge and skills for future job seeking. This section summarizes the definitions of Data Science and analyzes the features of this discipline. We also provide our own definition based on our understanding of this important field.

The following are prominent Data Science definitions we found in the literature:

• Data science as an expansion of the technical areas of the field of statistics.l (Cleveland, 2001).

• “Data science involves principles, processes, and techniques for understandingl phenomena via the (automated) analysis of data.” (Provost & Fawcett, 2013)

• “Data science is the study of the generalizable extraction of knowledge froml data.” (DhaR, 2013)

• “An extension of information systems research.” (Agarwall & Dhar, 2014)

• “This coupling of scientific discovery and practice involves the collection,l management, processing, analysis, visualization, and interpretation of vast amounts of heterogeneous data associated with a diverse array of scientific, translational, and interdisciplinary applications.” (Donoho, 2015)

• “Data science is now widely accepted as the fourth mode of scientific discovery,l on par with theory, physical experimentation and computational analysis. Techniques based on Big Data are showing promise not only in scientific research, but also in education, health, policy, and business. ” (Michigan Institute for Data Science, 2017)

• “Data science enables the creation of data products.” (Loukides, 2011, p.1)l

• “Data science is the study of where information comes from, what it represents,l and how it can be turned into a valuable resource in the creation of business and IT strategies.” (Banafa, 2014)

• “Data Science is not only a synthetic concept to unify statistics, data analysisl and their related methods but also comprises its results. It includes three phases, design for data, collection of data, and analysis on data. ” (Hayashi, 1998,p.41)

• “Data science is a combination of statistics, computer science, and informationl design” (Shum et al., 2013)

• “Data science is a new trans-disciplinary field that builds on and synthesizes al number of relevant disciplines and bodies of knowledge, including statistics, informatics, computing, communication, management, and sociology, to study data following data thinking” (Cao, 2017)

The above definitions reflect current understanding of this discipline. Together they help to describe the characteristics of Data Science, which can be summarized into the following aspects:

• The center of Data Science is data. Especially, big data becomes the subject thatl is investigated;

• The purpose of Data Science is to obtain information or knowledge from data.l The information will help to make better decisions, and the knowledge may help an organization, a state, a country, or the whole of humanity to better understand the development or change of the nature or the society;

• Data science is a multi-disciplinary field which has applied theories andl technologies from a number of disciplines and areas such as mathematics, statistics, computer sciences, information systems, and Information Science. It is expected that Data Science will bring change to impact these disciplines.

We define Data Science as:

Data Science is an interdisciplinary field that explores scientific methodology and computational technology about data, including data management, access, analysis, and evaluation for the benefits of human beings.

In our definition, data is at the center of Data Science. Any activities around data should be considered within the scope of this field. In other words, Data Science not only deals with data mining or data analysis, but also explores technologies and approaches for other important activities including data management, data access, and evaluation. We avoid to list related disciplines in the definition because we believe Data Science has impact and promises new opportunities to every scientific and industry field.

As an emerging discipline, Data Science is in still in its evolving stage. More research and investigation are needed to understand Data Science problems, challenges, theories, and methodologies. Interesting questions in Data Science to explore can be: what are its other major concepts in addition to data? What are the problems Data Science should attempt to tackle? And do we have enough theories and frameworks to guide future practice in Data Science? As specified by Provost and Fawcett (2013), in order for Data Science to flourish as a separate field, we should think beyond the specific algorithms, techniques, and tools that we are currently using and work on developing the core principles and concepts that underlie related techniques.

2.2 Most Related Disciplines and Fields for Data Science Data science has been considered as an interdisciplinary field since its emergence. It was originally developed within the statistics and mathematics community (Cao, 2017). At its current stage, Data Science is more closely related to several disciplines than others, which is evident from the definitions in the literature provided in the previous section. Even though we believe Data Science is relevant to every discipline, some disciplines or fields are more related to it than others. These disciplines include Mathematics (especially Probability and Statistics), Information Science, Computer Science, Knowledge Management, Management Information Systems, and Decision Science. The Data Science programs at the doctoral and Master’s levels (see table 3, and table 4) confirm the above list, as most of the Data Science programs were established in colleges, schools, or departments of Information Science, Computer Science, Statistics, and Business Management.

Many studies have explored the connections between Data Science and Mathematics, Computer Science, and Management Information Systems (Cleveland, 2001; Shum, et al, 2013; Agarwal & Dhar, 2014). Mathematics including statistics provides numerical reasoning and methodologies for data analysis and processing. Some mathematic concepts and theorems such as those in Calculus, Linear Algebra, and statistics needed to be mastered by potential data science workers; Computer Science provides programming languages, database systems, and algorithms that a data science worker needs to manipulate numerical and textual data; Management Information Systems , on the other hand, have explored data mining and data analytics for years.

However, we found that the connection of Data Science with Information Science was largely neglected which may negatively impact the development of Data Science as a discipline. The impression of the public on Information Science is actually Library Science, which has focused more on classic information resource management, library management, and reference services in libraries. Even though Information Science has been closely related to Library Science in history, its landscape has dramatically changed since the advent of the Internet. Information Science can definitely provide useful theories and guidance to the development of Data Science. For example, one of the important concepts to explore in Data Science may be the data life cycle. There are many ways to describe and depict data life cycle (Chang, 2012). For data management, the data life cycle can include creating data, processing data, analyzing data, presenting data, giving access to data, and re-using data (Boston University Libraries, n.d.). Or it can be a series of processes to create, store, use, share, archive, and destroy data (Spirion, n.d.). Explanations of the different stages included in the life cycle can be different. But some processes, such as archiving, storing, and accessing have been well explored for information in Information Science.

Data Science is also closely related to Knowledge Management. Hawamdeh (2003) describes the field of Knowledge Management and its relationship with other disciplines for the purpose of educating knowledge professionals (p. 168). He considered Knowledge Management a multidisciplinary subject. Knowledge management professionals need to acquire information technology and Information Science skills, which serve as foundation for higher-level knowledge work such as knowledge management and sharing. We expand on his framework by including an additional step, shown in Figure 1, called data processing, analysis, and processing, between the original step 2: Information acquisition and content management and Step 3: Information and knowledge sharing. In this consideration, Information Science, Data Science, and Knowledge Management are closely related and overlapping disciplines or areas. Typically, Data Science centers on activities in step 2, but should also include activities and tasks in steps 2, 4 and 5.

2.3 Data Scientists: the Professions of Doing Data Science

The increasing interest in Data Science calls for more Data Science workers. Several reports indicated that data scientists, which is very frequently called the “sexiest job of the 21st century” (Davenport & Patil, 2012), are in high demand. Mckinsey and Company (2011) pointed to a fact that the U.S. is facing a shortage of 140,000 to 190,000 analytical and managerial skills necessary to handle business processes related to critical decision making and big data. Additionally, companies in other sectors have started recruiting data scientists as a crucial component to conduct their business strategy. For example, Greylock Partners, a leading venture capital firm in Silicon Valley that was an early investor in such well-known companies such as Facebook and LinkedIn, is now showing more concern about expanding its team[1]based portfolio and recruiting more talented people in the Data Science and Analytics fields (Davenport & Patil, 2012).

A data scientist is usually described as “a hybrid of data hacker, analyst, communicator, and trusted adviser” (Davenport & Patil, 2012, p. 73). Data scientists’ jobs “involve data collecting, cleaning, formatting, building statistical and mathematical models, and communicate professionally with their partners and clients” (Carter & Sholler, 2016). According to Stodder (2015), data scientists begin by analyzing business problems, and then providing strategic and meaningful insights about probable actions and future plans. This strategy involves identifying business drivers, establishing a creative team which is characterized by excellent technical and communication skills, and using the latest analytics and visualization techniques (Stodder, 2015). Provost and Fawcett (2013) specified that successful data scientists “must be able to view business problems from a data perspective. ”

The literature also describes knowledge and skills that are necessary for data scientists. For example, DhaR (2013) believed that “a data scientist requires an integrated skill set spanning mathematics, machine learning, artificial intelligence, statistics, databases, and optimization, along with a deep understanding of the craft of problem formulation to engineer effective solutions.” He considered four basic types of courses that contribute to the knowledge and skills for a qualified data scientist: Machine Learning, Computer Science (data structure, algorithms, and systems), correlation and causation, and problem formulation.

Data scientist can be a general term for people working in Data Science. Different titles can be found from job postings and literature for related positions. It is important to realize that data scientists may vary in their background, experiences, skills, and analytical workflows. In semi-structured interviews of 35 data analysts, Kandel, Paepcke, Hellerstein, and Heer (2012) indicated that analysts fall into three categories which are hacker, scripter, and application user, each with different levels of technical skills and tasks that can be performed. They, however, generally share five common tasks that support daily insights, which are discovery, wrangling, profiling, modeling, and reporting.

In order to verify what we have learned from the literature and to understand the increase in the demand for data scientists (Leopold, 2017), we conducted a small[1]scale job analysis. The analysis is reported in the next section.

3. Data Science and Data Analytics Jobs: An Analysis

One of the goals of this chapter is to explore how higher education could produce qualified Data Science workers, or data scientists. We therefore conducted a preliminary job analysis to understand current job markets for Data Science workers. This section will present the purposes of the analysis, the research questions, the process of data collection and analysis, and the results.

3.1 Purposes of Analysis and Research Questions

This analysis aimed to understand the current job market for data scientists. Such understanding would provide us with guidelines to designing Data Science courses and programs at undergraduate and graduate levels. Specifically, we would like to find answers to the following questions:

(1) What characterizes the employers who hire data scientists?

(2) What were the job titles used when employers are looking for qualified Data Science workers?

(3) What are the general qualifications, knowledge, tools, or skills required by employers?

As mentioned earlier, our analysis also served as a validation of knowledge, skills, and competences for qualified data scientists addressed in the literature.

3.2 Data Collection

We recruited the students of graduate-level Information Science class at the University of North Texas to collect the data. INFO 5717 is a course in which students learn to build web-based database systems. The students were assigned to five teams with three members in each team. They were asked to collect, process, and upload at least 60 job postings pertaining to data analysts, information management specialists, data scientists, and business intelligence analysts to a MySQL database.

Prior to data collection, we provided a list of metadata elements to guide the process. Students were required to collect information for at least the following elements for each job posting:

• Title of the Jobl

• Start date and end date for the position

• Name of the employer

• Salary

• URL of the job posting at the employer’s website

• Position requirements

• Position responsibilities

Students were instructed to collect the above data on the Internet using methods they considered appropriate. The data was collected from job aggregator websites such as Linkedin.com, indeed.com, dice.com, and glassdoor.com in November 2016. Additionally, some of the teams also collected data pertaining to the employers such as the URL of the official employer website, contact email, and the mailing address. The data was initially saved as text files.

We also used this project as an opportunity to train students for future Data Science related skills. Therefore we asked students to perform a series of tasks to process the collected data: each student team was required to design a job database using MySQL, containing one or two database tables to store the job posting and the employer information. Using HTML and PHP scripts, students created simple web entry forms to insert the job and employer information into their databases. As part of their course requirements, students developed web pages to manipulate or send queries to retrieve information from the data. They also developed authentication and session-control mechanisms to secure and use their database. In total the teams collected 364 job postings.

3.3 Data Cleanup and Integration

After students’ classwork, we exported the job and employer data from five databases to Excel spreadsheets for analysis. We found that even though we gave students metadata specifications, the database tables from each team were not completely aligned – some had more elements, or the same elements were called different names. Employer information was more problematic because we did not specify what exactly information should be collected about employers except that we needed to have the name and URL of the employer for each job posting. Furthermore, students might have collected the same job posting (duplication) as the teams worked independently of each other. Some of the collected records had some missing values such as job location and salaries because the information pertaining to the corresponding job was not stated clearly in the original announcements.

We conducted manual data cleanup to fix the above data issues. Some minor cases of missing salary values were handled by using online popular salaries estimate tools such as the job-search engine indeed.com (Doherty, 2010). The manual data cleanup was possible because we had only 364 job postings. After removing the duplicates and consolidating the data elements, we obtained 298 usable observations for analysis.

3.4 Tools for Data Analysis

For this small dataset, we mainly used Microsoft Excel to sort the different elements, to generate frequency counts and percentages, and to produce diagrams. Additionally, SAS Enterprise Miner (SASEM) 14.1 was utilized to perform cluster analysis (SAS Institute Inc, 2014) on qualifications and required experience as presented in the job postings. We also wrote a simple program using the R language to extract the most frequent keywords and produced a visual representation for the results called a word cloud (Heimerl, Lohmann, Lange, & Ertl, 2014). Manual data transformation and adjustment were performed in order to use these tools for the purpose of answering the research questions proposed in Section 3.1.

3.5 Results and Discussion

In this section, we present the results of our analysis and answer our research questions:

3.5.1 Characteristics of the Employers

From the 298 valid job postings, 236 company names were identified by the students. Among them we found some well-known companies such as Apple, Blue Cross and Blue Shield, Capital One, Bank of America, Ericsson, Google, IBM, Infosys, Nike, Samsung, Sprint, Verizon, Walmart, Wells Fargo, and Xerox. These employers covered different industrial fields such as Information Technologies, Healthcare, Insurance, Finance, Retail, Biotechnology, Business services, Manufacturing, and Media sectors.

With respect to job postings’ geographical concentration, Texas leads the United States with more than 24% of the jobs posting, next comes Illinois and California with 17% and 15% respectively (Figure 2). The collected data also showed that both graduates and undergraduates were possible candidates for Data Science jobs. The preferred levels of experience ranged from two to more than 10 years. These results portrayed an image of the job market for Data Science.

3.5.2 Job Titles

A frequency analysis of the 298 job postings showed that the top job titles used by the employers were: data scientist, data analyst, business intelligence analyst or intelligence analyst, information management specialist/analyst, and data engineer. For data scientist and data analysts, the job title could specify different levels, such as lead or principal, senior, junior, or intern. Some job titles, such as data analytics scientist and Data Science quantitative analyst, were a mixture of data analyst, data science, or data scientist. Other titles included big data architect, data integrity specialist, data informatics scientist, marketing analyst, MIS specialist, and project delivery specialist and consultant on data analytics.

3.5.3 Word Cloud and Clusters on Qualifications and Responsibilities

We used the R programming language to visually summarize the most frequent terms occurred in two fields: qualifications and responsibilities. The word cloud in Figure 3 shows that terms like “experience”, “business”, “analyst”, “analysis”, “management”, and “skills” stood out for their frequent occurrences, as depicted in Figure 1. This was supported by many field researchers, career practitioners, and job posters who confirmed that effective data scientists and analysts need to have a substantial level of managerial and business experience in addition to sophisticated skills in programming, development, design, and statistics (Davenport & Patil, 2012; Kandel, Paepcke, Hellerstein, & Heer,2012).

We also conducted cluster analysis on these two fields. The purpose of cluster analysis was to identify concepts that are implied in the dataset. The process starts by exploring the textual data content, using the results to group the data into meaningful clusters, and reporting the essential concepts found in these clusters (SAS Institute Inc, 2014). Using the SASEM cluster analysis function, which is based on the mutual information weighting method, we obtained five clusters as presented in Table 2. According to the terms in each cluster, we could name Cluster 1 as project management; Cluster 2 as machine learning and algorithmic skills; Cluster 3 as statistical models and business analytics; Cluster 4 as database management & systems support, and Cluster 5 as communication skills. Among them, Cluster 2 is the largest cluster with a relative frequency of 47% among all the terms in these two fields.

The Text Filter Node functionality in SAS enabled us to invoke a filter viewer to explore interactive relationships among terms (SAS Institute Inc, 2014). To preview the concepts that are highly correlated with the term data, a concepts link map was generated, as depicted in Figure 4. The five clusters in Table 2 become the six nodes (two notes on Analysis/Analyze) with links to other outer notes.

The word cloud and cluster analysis showed a broader view of the terms from the job market perspective. Also, further exploration of the concepts link map revealed that data scientists need to have substantial skills in Hadoop, Hive, Pig, Python, Machine Learning, C++ and Java programming, R, and SQL, in addition to being proficient with statistical and modeling packages such as SPSS and SASEM.

3.6 Summary of the Job Posting Analysis

This small-scale study covers many activities a data analyst may perform according to (Donoho, 2015). We realized that a Data Science worker needed to know how to organize and cleanup data, how to analyze them, and how to create data visualizations using different tools. It would be very beneficial if he or she also possesses basic research skills, such as how to collect data and how to evaluate it. Data analysis also involves team-based work of data scientists who use effective tools to build statistical and predictive models to produce business and actionable insights within their organizations, which should be also built in Data Science curriculum.

Our job analysis above provides insight into this new Data Science discipline and its career opportunities. These insights can guide educators to develop appropriate Data Science programs and courses that meet the short-term and long-term needs of the digital economy. Next we present our thoughts on Data Science education based on this job analysis and an overview of existing Data Science programs.

Monday, February 7, 2022

Fundamentals of Data Science for Future Data Scientists

No comments:

Post a Comment

Labels

INSTRUMENTATION MANUFACTURERS