Industries Needs: DATA SCIENCE TECHNOLOGY STACK

PROCESS SUPERSTEP

7.0 OBJECTIVES

The objective of this chapter to learn Time-Person-Object[1]Location-Event(T-P-O-L-E) design principle and various concepts that are use to create/define relationship among this data.

7.2 INTRODUCTION

The Process superstep uses the assess results of the retrieve versions of the data sources into a highly structured data vault. These data vaults form the basic data structure for the rest of the data science steps. The Process superstep is the amalgamation procedure that pipes your data sources into five primary classifications of data.

7.2 DATA VAULT

Data Vault modelling is a technique to manage long term storage of data from multiple operation system. It storeshistorical data in the database.

7.2.1 Hubs:

Data vault hub is used to store business key. These keys do not change over time. Hub also contains a surrogate key for each hub entry and metadata information for a business key.

7.2.2 Links:

Data vault links are join relationship between business keys.

7.2.3 Satellites:

Data vault satellites stores the chronological descriptive and characteristics for a specific section of business data. Using we get model structure but no chronological characteristics. Satellites consist of characteristics and metadata linking them to their specific hub.

7.2.4 Reference Satellites:

Reference satellites are referenced from satellites that can be u by other satellites to prevent redundant storage of reference characteristics.

7.3 TIME-PERSON -OBJECT-LOCATION-EVENT DATA VAULT

We will use Time- Person-Object-Location-Event (T-P-O-L-E) design principle.

All five sections are linked with each other, resulting into sixteen links.

7.4 TIME SECTION

Time section contain data structure to store all time related information.

For example, time at which event has occurred.

This hub act as connector between time zones. Following are the fields of time hub.

7.4.2 Time Links:

Following are the time links that can be stored as separate links.

• Time-Person Link

• This link connects date-time values from time hub to person hub.

• Dates such as birthdays, anniversaries, book access date, etc.

• Time-Object Link

• This link connects date-time values from time hub to object hub.

• Dates such as when you buy or sell car, house or book, etc.

• Time-Location Link

• This link connects date-time values from time hub to location hub.

• Dates such as when you moved or access book from post code, etc.

• Time-Event Link

• This link connects date-time values from time hub to event hub.

• Dates such as when you changed vehicles, etc.

7.4.3 Time Satellites:

Following are the fields of time satellites.

Time satellite can be used to move from one time zone to other very easily. This feature will be used during Transform superstep.

7.5 PERSON SECTION

Person section contains data structure to store all data related to person.

7.5.1 Person Hub:

Following are the fields of Person hub.

7.5.2 Person Links:

Person Links connect person hub to other hubs.

Following are the person links that can be stored as separate links.

• Person-Time Link

• This link contains relationship between person hub and time hub.

• Person-Object Link

• This link contains relationship between person hub and object hub.

• Person-Location Link

• This link contains relationship between person hub and location hub.

• Person-Event Link

• This link contains relationship between person hub and event hub.

7.5.3 Person Satellites:

Person satellites are part of vault. Basically, it is information about birthdate, anniversary or validity dates of ID for respective person.

7.6 OBJECT SECTION

Object section contains data structure to store all data related to object.

7.6.1 Object Hub:

Object hub represent a real-world object with few attributes. Following are the fields of object hub.

7.6.2 Object Links:

Object Links connect object hub to other hubs.

Following are the object links that can be stored as separate links.

• Object-Time Link

• This link contains relationship between Object hub and time hub.

• Object-Person Link

• This link contains relationship between Object hub and Person hub.

• Object-Location Link

• This link contains relationship between Object hub and Location hub.

• Object-Event Link

• This link contains relationship between Object hub and event hub.

7.6.3 Object Satellites:

Object satellites are part of vault. Basically, it is information about ID,UUID, type, key, etc. for respective object.

7.7 LOCATION SECTION

Location section contains data structure to store all data related to location.

7.7.1 Location Hub:

The location hub consists of a series of fields that supports a GPS location. The locationhub consists of the following fields:

7.7.2Location Links:

Location Links connect location hub to other hubs.

Following are the location links that can be stored as separate links.

• Location-Time Link

• This link contains relationship between location hub and time hub.

• Location-Person Link

• This link contains relationship between location hub and person hub.

• Location-Object Link

• This link contains relationship between location hub and object hub.

• Location-Event Link

• This link contains relationship between location hub and event hub.

7.7.3 Location Satellites:

Location satellites are part of vault that contains locations of entities.

7.8 EVENT SECTION

It contains data structure to store all data of entities related to event that has occurred.

7.8.1 Event Hub:

Event hub contains various fields that stores real world events.

7.8.2 Event Links:

Event Links connect event hub to other hubs.

Following are the time links that can be stored as separate links.

• Event-Time Link

• This link contains relationship between event hub and time hub.

• Event-Person Link

• This link contains relationship between event hub and person hub.

• Event-Object Link

• This link contains relationship between event hub and object hub.

• Event-Location Link

• This link contains relationship between event hub and location hub.

7.8.3 Event Satellites:

Event satellites are part of vault it contains event information that occurs in the system.

7.9 ENGINEERING A PRACTICAL PROCESS SUPERSTEP

Time:

Time is most important characteristics of data used to record event time. ISO 8601-2004 defines an international standard for interchange formats for dates and times.

The following entities are part of ISO 8601-2004 standard:

Year, month, day, hour, minute, second, and fraction of a second

The data/time is recorded from largest (year) to smallest (fraction of second). These values must have a pre-approved fixed number of digits that are padded with leading zeros.

Year

The standard uses four digits to represent year. The values ranges from 0000 to 9999.

AD/BC requires conversion

from datetime import datetime

from pytz import timezone, all_timezones

now_date = datetime(2020,1,2,3,4,5,6)

now_utc=now_date.replace(tzinfo=timezone('UTC'))

print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)

(%z)")))

print('Year:',str(now_utc.strftime("%Y")))

Output:

Month

The standard uses two digits to represent month. The values ranges from 01 to 12.

The rule for a valid month is 12 January 2020 becomes 2020-11-12.

Above program can be updated to extract month value.

print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)

(%z)")))

print('Month:',str(now_utc.strftime("%m")))

print('Month Name:',str(now_utc.strftime("%B")))

Output:

Day

The standard uses two digits to represent month. The values ranges from 01 to 31.

The rule for a valid month is 22 January 2020 becomes 2020-01-22 or +2020-01-22.

print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)

(%z)")))

print('Day:',str(now_utc.strftime("%d")))

Output:

Hour:

The standard uses two digits to represent hour. The values ranges from 00 to 24.

The valid format is hhmmss or hh:mm:ss. The shortened format hhmm or hh:mm is accepted

The use of 00:00:00 is the beginning of the calendar day. The use of 24:00:00 is only to indicate the end of the calendar day.

print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)

(%z)")))

print('Hour:',str(now_utc.strftime("%H")))

Output:

Minute:

The standard uses two digits to represent minute. The values ranges from 00 to 59.

The standard minute must use two-digit values within the range of 00 through 59.

The valid format is hhmmss or hh:mm:ss.

Output:

Second:

The standard uses two digits to represent second. The values ranges from 00 to 59.

The valid format is hhmmss or hh:mm:ss.

print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)

(%z)")))

print('Second:',str(now_utc.strftime("%S")))

Output:

The fraction of a second is only defined as a format: hhmmss,sss or hh:mm:ss,sss or

hhmmss.sss or hh:mm:ss.sss.

The current commonly used formats are the following:

• hh:mm:ss.s: Tenth of a second

• hh:mm:ss.ss: Hundredth of a second

• hh:mm:ss.sss: Thousandth of a second

print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)

(%z)")))

print('Millionth of Second:',str(now_utc.strftime("%f")))

Coordinated Universal Time (UTC)

A sample program to display current time.

from datetime import datetime

frompytz import all_timezones,timezone

#get the current time

now_date_local=datetime.now()

#Change the local time to 'Etc/GMT-4.1'

now_date=now_date_local.replace(tzinfo=timezone('Etc/GMT-4.1'))

#get the time in Mumbai, India

now_india=now_date.astimezone(timezone('Etc/GMT-4.1'))

print('India Date Time:',str(now_india.strftime("%Y-%m-%d %H:%M:%S

(%Z)(%z)")))

Output:

7.9.1 Event:

This structure records any specific event or action that is discovered in the data sources.Anevent is any action that occurs within the data sources. Events are recorded using threemain data entities: Event Type, Event Group, and Event Code. The details of each eventare recorded as a set of details against the event code. There are two main types of events.

7.9.2 Explicit Event:

This type of event is stated in the data source clearly and with full details. There is cleardata to show that the specific action was performed. Following are examples of explicit events:

• A security card with number 1234 was used to open door A.

• You are reading Chapter 9 of Practical Data Science.

• I bought ten cans of beef curry.

Explicit events are the events that the source systems supply, as these have directdata that proves that the specific action was performed.

7.9.3 Implicit Event:

This type of event is formulated from characteristics of the data in the source systemsplus aseries of insights on the data relationships.

The following are examples of implicit events:

• A security card with number 8884.1 was used to open door X.

• A security card with number 8884.1 was issued to Mr. Vermeulen.

• Room 302 is fitted with a security reader marked door X.

These three events would imply that Mr. Vermeulen entered room 302 as an event.Not true!

7.10 5-WHYS TECHNIQUE

Data science is at its core about curiosity and inquisitiveness. This core is rooted in the 5Whys. The 5 Whys is a technique used in the analysis phase of data science.

7.10.1 Benefits of the 5 Whys:

The 5 Whys assist the data scientist to identify the root cause of a problem and determine the relationship between different root causes of the same problem. It is one of the simplest investigative tools—easy to complete without intense statistical analysis.

7.10.2 When Are the 5 Whys Most Useful?:

The 5 Whys are most useful for finding solutions to problems that involve human factorsor interactions that generate multi-layered data problems.In day-to-day business life, they can be used in real-world businesses to find the rootcauses of issues.

7.10.3 How to Complete the 5 Whys?:

Write down the specific problem. This will help you to formalize the problem and describe it completely. It also helps the data science team to focus on the same problem. Ask why the problem occurred and write the answer below the problem. If the answer you provided doesn’t identify the root cause of the problem that you wrote down first, ask why again, and write down that answer. Loop back to the preceding step until you and your customer are in agreement that the problem’s root cause is identified. Again, this may require fewer or more than the5 Whys. Example:

Problem Statement: Customers are unhappy because they are being shipped products that don’t meet their specifications.

1. Why are customers being shipped bad products?

• Because manufacturing built the products to a specification that is different from what the customer and the salesperson agreed to.

2. Why did manufacturing build the products to a different specification than that of sales?

• Because the salesperson accelerates work on the shop floor by calling the head of manufacturing directly to begin work. An error occurred when the specifications were being communicated or written down.

3. Why does the salesperson call the head of manufacturing directlyto start work instead of following the procedure established by thecompany?

• Because the “start work” form requires the sales director’sapproval before work can begin and slows the manufacturingprocess (or stops it when the director is out of the office).

4. Why does the form contain an approval for the sales director?

• Because the sales director must be continually updated on salesfor discussions with the CEO, as my retailer customer was a topten key account.

In this case, only four whys were required to determine that a non[1]value-add edsignature authority helped to cause a process breakdown in the quality assurance for a key account! The rest was just criminal.

The external buyer at the wholesaler knew this process was regularly by passed and started buying the bad tins to act as an unofficial backfill for the failing process in the quality-assurance process in manufacturing, to make up the shortfalls in sales demand. The wholesaler simply relabelled the product and did not change how it was manufactured. The reason? Big savings lead to big bonuses. A key client’s orders had to be filled. Sales are important!

7.11 FISHBONE DIAGRAMS

The fishbone diagram or Ishikawa diagram is a useful tool to find where each data fits into data vault. This is a cause-and-effect diagram that helps managers to track down the reasons for imperfections, variations, defects, or failures. The diagram looks just like a fish’s skeleton with the problem at its head and the causes for the problem feeding into the spine. Once all the causes that underlie the problem have been identified, managers can start looking for solutions to ensure that the problem doesn’t become a recurring one. It can also be used in product development. Having a problem-solving product will ensure that your new development will be popular – provided people care about the problem you’re trying to solve. The fishbone diagram strives to pinpoint everything that’s wrong with current market offerings so that you can develop an innovation that doesn’t have these problems. Finally, the fishbone diagram is also a great way to look for and prevent quality problems before they ever arise. Use it to troubleshoot before there is trouble, and you can overcome all or most of your teething troubles when introducing something new.

7.12 MONTE CARLO SIMULATION

Monte Carlo simulation technique performs analysis by building models of possible results, by substituting a range of values—a probability distribution—for parameters that have inherent uncertainty. It then calculates results over and over, each time using a different set of random values from the probability functions. Depending on the number of uncertainties and the ranges specified for them, a Monte Carlo simulation can involve thousands or tens of thousands of recalculations before it is complete. Monte Carlosimulation produces distributions of possible outcome values. As a data scientist, this gives you an indication of how your model will react under real-life situations. It also gives the data scientist a tool to check complex systems, wherein the input parameters are high-volume or complex.

7.13 CAUSAL LOOP DIAGRAMS

A causal loop diagram (CLD) is a causal diagram that aids in visualizing how a number ofvariables in a system are interrelated and drive cause-and-effect processes. The diagramconsists of a set of nodes and edges. Nodes represent the variables, and edges are thelinks that represent a connection or a relation between the two variables.

Example: The challenge is to keep the “Number of Employees Available to Work andProductivity” as high as possible.

7.14 PARETO CHART

A Pareto chart is a bar graph. It is also called as Pareto diagram or Pareto analysis. The lengths of the bars represent frequency or cost (time or money), and are arranged with longest bars on the left and the shortest to the right. In this way the chart visually depicts which situations are more significant.

When to use Pareto Chart:

• When analysing data about the frequency of problems or causes in a process.

• When there are many problems or causes and you want to focus on the most significant.

• When analysing broad causes by looking at their specific components.

• When communicating with others about your data.

Following Diagram shows how many customer complaints were received in each of five categories.

7.15 CORRELATION ANALYSIS

The most common analysis I perform at this step is the correlation analysis of all the data in the data vault. Feature development is performed between data items, to find relationships between data values.

import pandas as pd

a = [ [1, 2, 4], [5, 4.1, 9], [8, 3, 13], [4, 3, 19], [5, 6, 12], [5, 6, 11],[5, 6,

4.1], [4, 3, 6]]

df = pd.DataFrame(data=a)

cr=df.corr()

print(cr)

7.16 FORECASTING

Forecasting is the ability to project a possible future, by looking at historical data. The data vault enables these types of investigations, owing to the complete history it collects as it processes the source’s systems data. You will perform many forecasting projects during your career as a data scientist and supply answers to such questions as the following:

• What should we buy?

• What should we sell?

• Where will our next business come from?

People want to know what you calculate to determine what is about to happen

7.17 DATA SCIENCE

Data Science work best when approved techniques and algorithms are followed.

After performing various experiments on data, the result must be verified and it must have support.

Data sciences that work follow these steps:

Step 1: It begins with a question.

Step 2: Design a model, select prototype for the data and start a virtual simulation. Some statistics and mathematical solutions can be added to start a data science model. All questions must be related to customer's business, such a way that answer must provide an insight of business.

Step3: Formulate a hypothesis based on collected observation. Based on model process the observation and prove whether hypothesis is true or false.

Step4: Compare the above result with the real-world observations and provide these results to real-life business.

Step 5: Communicate the progress and intermediate results with customers and subject expert and involve them in the whole process to ensure that they are part of journey of discovery.

7.18 UNIT END QUESTIONS

1. Explain the process superstep.

2. Explain concept of data valut.

3. What are the different typical reference satellites? Explain.

4. Explain the TPOLE design principle.

5. Explain the Time section of TPOLE.

6. Explain the Person section of TPOLE.

7. Explain the Object section of TPOLE.

8. Explain the Location section of TPOLE.

9. Explain the Event section of TPOLE.

10. Explain the different date and time formats. What is leap year? Explain.

11. What is an event? Explain explicit and implicit events.

12. How to Complete the 5 Whys?

13. What is a fishbone diagram? Explain with example.

14. Explain the significance of Monte Carlo Simulation and Causal Loop Diagram.

15. What are pareto charts? What information can be obtained from pareto charts?

16. Explain the use of correlation and forecasting in data science.

17. State and explain the five steps of data science.

Saturday, February 12, 2022

DATA SCIENCE TECHNOLOGY STACK

No comments:

Post a Comment

Labels

INSTRUMENTATION MANUFACTURERS