PROCESS SUPERSTEP
7.0 OBJECTIVES
The objective of this chapter to learn Time-Person-Object[1]Location-Event(T-P-O-L-E) design principle and various concepts that are use to create/define relationship among this data.
7.2 INTRODUCTION
The Process superstep uses the assess results of the retrieve versions of the data sources into a highly structured data vault. These data vaults form the basic data structure for the rest of the data science steps. The Process superstep is the amalgamation procedure that pipes your data sources into five primary classifications of data.
7.2 DATA VAULT
Data Vault modelling is a technique to manage long term storage of data from multiple operation system. It storeshistorical data in the database.
7.2.1 Hubs:
Data vault hub is used to store business key. These keys do not change over time. Hub also contains a surrogate key for each hub entry and metadata information for a business key.
7.2.2 Links:
Data vault links are join relationship between business keys.
7.2.3 Satellites:
Data vault satellites stores the chronological descriptive and characteristics for a specific section of business data. Using we get model structure but no chronological characteristics. Satellites consist of characteristics and metadata linking them to their specific hub.
7.2.4 Reference
Satellites:
Reference satellites are referenced from satellites that can be u by other satellites to prevent redundant storage of reference characteristics.
7.3 TIME-PERSON -OBJECT-LOCATION-EVENT
DATA VAULT
We will use Time- Person-Object-Location-Event (T-P-O-L-E) design principle.
All five sections are linked with each other, resulting into sixteen links.
7.4 TIME SECTION
Time section contain data structure to store all time related information.
For example, time at which event has occurred.
This hub act as connector between time zones. Following are the fields of time hub.
7.4.2 Time Links:
Following are the
time links that can be stored as separate links.
• Time-Person Link
• This link connects date-time values from time hub to person hub.
• Dates such as birthdays, anniversaries, book access date, etc.
• Time-Object Link
• This link connects date-time values from time hub to object hub.
• Dates such as when you buy or sell car, house or book, etc.
• Time-Location Link
• This link connects date-time values from time hub to location hub.
• Dates such as when you moved or access book from post code, etc.
• Time-Event Link
• This link connects date-time values from time hub to event hub.
• Dates such as when you changed vehicles, etc.
7.4.3 Time
Satellites:
Following are the fields of time satellites.
Time satellite can be used to move from one time zone to other very easily. This feature will be used during Transform superstep.
7.5 PERSON SECTION
Person section contains data structure to store all data related to person.
7.5.1 Person Hub:
Following are the fields of Person hub.
7.5.2 Person Links:
Person Links connect person hub to other hubs.
Following are the
person links that can be stored as separate links.
• Person-Time Link
• This link contains relationship between person hub and time hub.
• Person-Object Link
• This link contains relationship between person hub and object hub.
• Person-Location
Link
• This link contains relationship between person hub and location hub.
• Person-Event Link
• This link contains relationship between person hub and event hub.
7.5.3 Person
Satellites:
Person satellites are part of vault. Basically, it is information about birthdate, anniversary or validity dates of ID for respective person.
7.6 OBJECT SECTION
Object section contains data structure to store all data related to object.
7.6.1 Object Hub:
Object hub represent a real-world object with few attributes. Following are the fields of object hub.
7.6.2 Object Links:
Object Links connect object hub to other hubs.
Following are the
object links that can be stored as separate links.
• Object-Time Link
• This link contains relationship between Object hub and time hub.
• Object-Person Link
• This link contains relationship between Object hub and Person hub.
• Object-Location
Link
• This link contains relationship between Object hub and Location hub.
• Object-Event Link
• This link contains relationship between Object hub and event hub.
7.6.3 Object
Satellites:
Object satellites are part of vault. Basically, it is information about ID,UUID, type, key, etc. for respective object.
7.7 LOCATION SECTION
Location section contains data structure to store all data related to location.
7.7.1 Location Hub:
The location hub consists of a series of fields that supports a GPS location. The locationhub consists of the following fields:
7.7.2Location Links:
Location Links connect location hub to other hubs.
Following are the
location links that can be stored as separate links.
• Location-Time Link
• This link contains relationship between location hub and time hub.
• Location-Person
Link
• This link contains relationship between location hub and person hub.
• Location-Object
Link
• This link contains relationship between location hub and object hub.
• Location-Event Link
• This link contains relationship between location hub and event hub.
7.7.3 Location
Satellites:
Location satellites are part of vault that contains locations of entities.
7.8 EVENT SECTION
It contains data structure to store all data of entities related to event that has occurred.
7.8.1 Event Hub:
Event hub contains various fields that stores real world events.
7.8.2 Event Links:
Event Links connect event hub to other hubs.
Following are the
time links that can be stored as separate links.
• Event-Time Link
• This link contains relationship between event hub and time hub.
• Event-Person Link
• This link contains relationship between event hub and person hub.
• Event-Object Link
• This link contains relationship between event hub and object hub.
• Event-Location Link
• This link contains relationship between event hub and location hub.
7.8.3 Event
Satellites:
Event satellites are part of vault it contains event information that occurs in the system.
7.9 ENGINEERING A
PRACTICAL PROCESS SUPERSTEP
Time:
Time is most important characteristics of data used to record event time. ISO 8601-2004 defines an international standard for interchange formats for dates and times.
The following entities are part of ISO 8601-2004 standard:
Year, month, day,
hour, minute, second, and fraction of a second
The data/time is recorded from largest (year) to smallest (fraction of second). These values must have a pre-approved fixed number of digits that are padded with leading zeros.
Year
The standard uses four digits to represent year. The values ranges from 0000 to 9999.
AD/BC requires conversion
from datetime import datetime
from pytz import timezone, all_timezones
now_date = datetime(2020,1,2,3,4,5,6)
now_utc=now_date.replace(tzinfo=timezone('UTC'))
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
print('Year:',str(now_utc.strftime("%Y")))
Output:
Month
The standard uses two digits to represent month. The values ranges from 01 to 12.
The rule for a valid month is 12 January 2020 becomes 2020-11-12.
Above program can be updated to extract month value.
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
print('Month:',str(now_utc.strftime("%m")))
print('Month Name:',str(now_utc.strftime("%B")))
Output:
Day
The standard uses two digits to represent month. The values ranges from 01 to 31.
The rule for a valid month is 22 January 2020 becomes 2020-01-22 or +2020-01-22.
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
print('Day:',str(now_utc.strftime("%d")))
Output:
Hour:
The standard uses two digits to represent hour. The values ranges from 00 to 24.
The valid format is hhmmss or hh:mm:ss. The shortened format hhmm or hh:mm is accepted
The use of 00:00:00 is the beginning of the calendar day. The use of 24:00:00 is only to indicate the end of the calendar day.
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
print('Hour:',str(now_utc.strftime("%H")))
Output:
Minute:
The standard uses two digits to represent minute. The values ranges from 00 to 59.
The standard minute must use two-digit values within the range of 00 through 59.
The valid format is hhmmss or hh:mm:ss.
Output:
Second:
The standard uses two digits to represent second. The values ranges from 00 to 59.
The valid format is hhmmss or hh:mm:ss.
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
print('Second:',str(now_utc.strftime("%S")))
Output:
The fraction of a second is only defined as a format: hhmmss,sss or hh:mm:ss,sss or
hhmmss.sss or hh:mm:ss.sss.
The current commonly used formats are the following:
• hh:mm:ss.s: Tenth of a second
• hh:mm:ss.ss: Hundredth of a second
• hh:mm:ss.sss: Thousandth of a second
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
print('Millionth of Second:',str(now_utc.strftime("%f")))
Coordinated Universal
Time (UTC)
A sample program to
display current time.
from datetime import datetime
frompytz import all_timezones,timezone
#get the current time
now_date_local=datetime.now()
#Change the local time to 'Etc/GMT-4.1'
now_date=now_date_local.replace(tzinfo=timezone('Etc/GMT-4.1'))
#get the time in Mumbai, India
now_india=now_date.astimezone(timezone('Etc/GMT-4.1'))
print('India Date Time:',str(now_india.strftime("%Y-%m-%d %H:%M:%S
(%Z)(%z)")))
Output:
7.9.1 Event:
This structure records any specific event or action that is discovered in the data sources.Anevent is any action that occurs within the data sources. Events are recorded using threemain data entities: Event Type, Event Group, and Event Code. The details of each eventare recorded as a set of details against the event code. There are two main types of events.
7.9.2 Explicit Event:
This type of event is stated in the data source clearly and with full details. There is cleardata to show that the specific action was performed. Following are examples of explicit events:
• A security card with number 1234 was used to open door A.
• You are reading Chapter 9 of Practical Data Science.
• I bought ten cans of beef curry.
Explicit events are the events that the source systems supply, as these have directdata that proves that the specific action was performed.
7.9.3 Implicit Event:
This type of event is formulated from characteristics of the data in the source systemsplus aseries of insights on the data relationships.
The following are examples of implicit events:
• A security card with number 8884.1 was used to open door X.
• A security card with number 8884.1 was issued to Mr. Vermeulen.
• Room 302 is fitted with a security reader marked door X.
These three events would imply that Mr. Vermeulen entered room 302 as an event.Not true!
7.10 5-WHYS TECHNIQUE
Data science is at its core about curiosity and inquisitiveness. This core is rooted in the 5Whys. The 5 Whys is a technique used in the analysis phase of data science.
7.10.1 Benefits of
the 5 Whys:
The 5 Whys assist the data scientist to identify the root cause of a problem and determine the relationship between different root causes of the same problem. It is one of the simplest investigative tools—easy to complete without intense statistical analysis.
7.10.2 When Are the 5
Whys Most Useful?:
The 5 Whys are most useful for finding solutions to problems that involve human factorsor interactions that generate multi-layered data problems.In day-to-day business life, they can be used in real-world businesses to find the rootcauses of issues.
7.10.3 How to
Complete the 5 Whys?:
Write down the specific problem. This will help you to formalize the problem and describe it completely. It also helps the data science team to focus on the same problem. Ask why the problem occurred and write the answer below the problem. If the answer you provided doesn’t identify the root cause of the problem that you wrote down first, ask why again, and write down that answer. Loop back to the preceding step until you and your customer are in agreement that the problem’s root cause is identified. Again, this may require fewer or more than the5 Whys. Example:
Problem Statement:
Customers are unhappy because they are being shipped products that don’t meet
their specifications.
1. Why are customers being shipped bad products?
• Because manufacturing built the products to a specification that is different from what the customer and the salesperson agreed to.
2. Why did manufacturing build the products to a different specification than that of sales?
• Because the salesperson accelerates work on the shop floor by calling the head of manufacturing directly to begin work. An error occurred when the specifications were being communicated or written down.
3. Why does the salesperson call the head of manufacturing directlyto start work instead of following the procedure established by thecompany?
• Because the “start work” form requires the sales director’sapproval before work can begin and slows the manufacturingprocess (or stops it when the director is out of the office).
4. Why does the form contain an approval for the sales director?
• Because the sales director must be continually updated on salesfor discussions with the CEO, as my retailer customer was a topten key account.
In this case, only four whys were required to determine that a non[1]value-add edsignature authority helped to cause a process breakdown in the quality assurance for a key account! The rest was just criminal.
The external buyer at the wholesaler knew this process was regularly by passed and started buying the bad tins to act as an unofficial backfill for the failing process in the quality-assurance process in manufacturing, to make up the shortfalls in sales demand. The wholesaler simply relabelled the product and did not change how it was manufactured. The reason? Big savings lead to big bonuses. A key client’s orders had to be filled. Sales are important!
7.11 FISHBONE DIAGRAMS
The fishbone diagram or Ishikawa diagram is a useful tool to find where each data fits into data vault. This is a cause-and-effect diagram that helps managers to track down the reasons for imperfections, variations, defects, or failures. The diagram looks just like a fish’s skeleton with the problem at its head and the causes for the problem feeding into the spine. Once all the causes that underlie the problem have been identified, managers can start looking for solutions to ensure that the problem doesn’t become a recurring one. It can also be used in product development. Having a problem-solving product will ensure that your new development will be popular – provided people care about the problem you’re trying to solve. The fishbone diagram strives to pinpoint everything that’s wrong with current market offerings so that you can develop an innovation that doesn’t have these problems. Finally, the fishbone diagram is also a great way to look for and prevent quality problems before they ever arise. Use it to troubleshoot before there is trouble, and you can overcome all or most of your teething troubles when introducing something new.
7.12 MONTE CARLO
SIMULATION
Monte Carlo simulation technique performs analysis by building models of possible results, by substituting a range of values—a probability distribution—for parameters that have inherent uncertainty. It then calculates results over and over, each time using a different set of random values from the probability functions. Depending on the number of uncertainties and the ranges specified for them, a Monte Carlo simulation can involve thousands or tens of thousands of recalculations before it is complete. Monte Carlosimulation produces distributions of possible outcome values. As a data scientist, this gives you an indication of how your model will react under real-life situations. It also gives the data scientist a tool to check complex systems, wherein the input parameters are high-volume or complex.
7.13 CAUSAL LOOP
DIAGRAMS
A causal loop diagram (CLD) is a causal diagram that aids in visualizing how a number ofvariables in a system are interrelated and drive cause-and-effect processes. The diagramconsists of a set of nodes and edges. Nodes represent the variables, and edges are thelinks that represent a connection or a relation between the two variables.
Example: The challenge is to keep the “Number of Employees Available to Work andProductivity” as high as possible.
7.14 PARETO CHART
A Pareto chart is a bar graph. It is also called as Pareto diagram or Pareto analysis. The lengths of the bars represent frequency or cost (time or money), and are arranged with longest bars on the left and the shortest to the right. In this way the chart visually depicts which situations are more significant.
When to use Pareto Chart:
• When analysing data about the frequency of problems or causes in a process.
• When there are many problems or causes and you want to focus on the most significant.
• When analysing broad causes by looking at their specific components.
• When communicating with others about your data.
Following Diagram shows how many customer complaints were received in each of five categories.
7.15 CORRELATION
ANALYSIS
The most common analysis I perform at this step is the correlation analysis of all the data in the data vault. Feature development is performed between data items, to find relationships between data values.
import pandas as pd
a = [ [1, 2, 4], [5, 4.1, 9], [8, 3, 13], [4, 3, 19], [5, 6, 12], [5, 6, 11],[5, 6,
4.1], [4, 3, 6]]
df = pd.DataFrame(data=a)
cr=df.corr()
print(cr)
7.16 FORECASTING
Forecasting is the ability to project a possible future, by looking at historical data. The data vault enables these types of investigations, owing to the complete history it collects as it processes the source’s systems data. You will perform many forecasting projects during your career as a data scientist and supply answers to such questions as the following:
• What should we buy?
• What should we sell?
• Where will our next business come from?
People want to know what you calculate to determine what is about to happen
7.17 DATA SCIENCE
Data Science work best when approved techniques and algorithms are followed.
After performing various experiments on data, the result must be verified and it must have support.
Data sciences that work follow these steps:
Step 1: It begins with a question.
Step 2: Design a model, select prototype for the data and start a virtual simulation. Some statistics and mathematical solutions can be added to start a data science model. All questions must be related to customer's business, such a way that answer must provide an insight of business.
Step3: Formulate a hypothesis based on collected observation. Based on model process the observation and prove whether hypothesis is true or false.
Step4: Compare the above result with the real-world observations and provide these results to real-life business.
Step 5: Communicate the progress and intermediate results with customers and subject expert and involve them in the whole process to ensure that they are part of journey of discovery.
7.18 UNIT END
QUESTIONS
1. Explain the process superstep.
2. Explain concept of data valut.
3. What are the different typical reference satellites? Explain.
4. Explain the TPOLE design principle.
5. Explain the Time section of TPOLE.
6. Explain the Person section of TPOLE.
7. Explain the Object section of TPOLE.
8. Explain the Location section of TPOLE.
9. Explain the Event section of TPOLE.
10. Explain the different date and time formats. What is leap year? Explain.
11. What is an event? Explain explicit and implicit events.
12. How to Complete the 5 Whys?
13. What is a fishbone diagram? Explain with example.
14. Explain the significance of Monte Carlo Simulation and Causal Loop Diagram.
15. What are pareto charts? What information can be obtained from pareto charts?
16. Explain the use of correlation and forecasting in data science.
17. State and explain the five steps of data science.
No comments:
Post a Comment
Tell your requirements and How this blog helped you.