Unit Structure
10.0 OBJECTIVES
The objective of this chapter is to learn the data transformation where it brings data to knowledge and converts results into insights.
10.1 INTRODUCTION
The Transform superstep allows to take data from the data vault and formulate answers to questions. The transformation step is the data science process that converts results into meaningful insights.
10.2 OVERVIEW
As to explain the below scenario is shown.
Data is categorised in to 5 different dimensions:
6. Time
7. Person
8. Object
9. Location
10. Event
10.3 DIMENSION
CONSOLIDATION
The data vault consists of five categories of data, with linked relationships and additional characteristics in satellite hubs.
10.4 THE SUN MODEL
The use of sun models is a technique that enables the data scientist to perform consistent dimension consolidation, by explaining the intended data relationship with the business, without exposing it to the technical details required to complete the transformation processing.
The sun model is constructed to show all the characteristics from the two data vault hub categories you are planning to extract. It explains how you will create two dimensions and a fact via the Transform step.
10.5 TRANSFORMING
WITH DATA SCIENCE
10.5.1 Missing value
treatment:
You must describe in detail what the missing value treatments are for the data lake transformation. Make sure you take your business community with you along the journey. At the end of the process, they must trust your techniques and results. If they trust the process, they will implement the business decisions that you, as a data scientist, aspire to achieve. Why Missing value treatment is required?
Explain with notes on the data traceability matrix why there is missing data in the data lake. Remember: Every inconsistency in the data lake is conceivably the missing insight your customer is seeking from you as a data scientist. So, find them and explain them. Your customer will exploit them for business value.
Why Data Has Missing
Values ?:
The 5 Whys is the technique that helps you to get to the root cause of your analysis. The use of cause-and-effect fishbone diagrams will assist you to resolve those questions. I have found the following common reasons for missing data:
• Data fields renamed during upgrades
• Migration processes from old systems to new systems where mappings were incomplete
• Incorrect tables supplied in loading specifications by subject-matter expert
• Data simply not recorded, as it was not available
• Legal reasons, owing to data protection legislation, such as the General Data Protection Regulation (GDPR), resulting in a not-to-process tag on the data entry
• Someone else’s “bad” data science. People and projects
make mistakes, and you will have to fix their errors in your own data science.
10.5.1Techniques of
outlier detection and Treatment:
10.6 HYPOTHESIS
TESTING
Hypothesis testing is not precisely an algorithm, but it’s a
must[1]know
for any data scientist. You cannot progress until you have thoroughly mastered
this technique. Hypothesis testing is the process by which statistical tests
are used to check if a hypothesis is true, by using data. Based on hypothetical
testing, data scientists choose to accept or reject the hypothesis. When an
event occurs, it can be a trend or happen by 141 chance. To check whether the
event is an important occurrence or just happenstance, hypothesis testing is
necessary.
There are many tests for hypothesis testing, but the following two are most popular.
T-test and Chi-Square test.
10.7 CHI-SQUARE TEST
There are two types of chi-square tests. Both use the chi-square statistic and distribution for different purposes:
A chi-square goodness of fit test determines if a sample data matches a population. For more details on this type, see: Goodness of Fit Test.
A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another. A very small chi square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship. A very large chi square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.
10.8 UNIVARIATE
ANALYSIS
Univariate analysis is the simplest form of analysing data. “Uni” means “one”, so in other words your data has only one variable. It doesn’t deal with causes or relationships (unlike regression) and its major purpose is to describe; It takes data, summarizes that data and finds patterns in the data.
Univariate analysis is used to identify those individual metabolites which, either singly or multiplexed, are capable of differentiating between biological groups, such as separating tumour bearing mice from nontumor bearing (control) mice. Statistical procedures used for this analysis include a t-test, ANOVA, Mann–Whitney U test, Wilcoxon signed-rank test, and logistic regression. These tests are used to individually or globally screen the measured metabolites for an association with a disease.
10.9 BIVARIATE
ANALYSIS
Bivariate analysis is when two variables are analysed together for any possible association or empirical relationship, such as, for example, the correlation between gender and graduation with a data science degree? Canonical correlation in the experimental context is to take two sets of 142 variables and see what is common between the two sets. Graphs that are appropriate for bivariate analysis depend on the type of variable. For two continuous variables, a scatterplot is a common graph. When one variable is categorical and the other continuous, a box plot is common, and when both are categorical, a mosaic plot is common.
10.10 MULTIVARIATE
ANALYSIS
A single metabolite biomarker is generally insufficient to differentiate between groups. For this reason, a multivariate analysis, which identifies sets of metabolites (e.g., patterns or clusters) in the data, can result in a higher likelihood of group separation. Statistical methods for this analysis include unsupervised methods, such as principle component analysis (PCA) or cluster analysis, and supervised methods, such as latent Dirichlet allocation (LDA), partial least squares (PLS), PLS Discriminant Analysis (PLS-DA), artificial neural network (ANN), and machine learning methods. These methods provide an overview of a large dataset that is useful for identifying patterns and clusters in the data and expressing the data to visually highlight similarities and differences. Unsupervised methods may reduce potential bias since the classes are unlabelled.
Regardless of one's choice of method for statistical analysis, it is necessary to subsequently validate the identified potential biomarkers and therapeutic targets by examining them in new and separate sample sets (for biomarkers), and in vitro and/or in vivo experiments evaluating the identified pathways or molecules (for therapeutic targets)
10.11 LINEAR
REGRESSION
10.12 Logistic
Regression
Logistic regression is one another technique, used for converting binary classification (dichotomous) problem to linear regression problems. Logistic regression is a predictive analysis technique. This could be difficult to interpret so there are tools available for it. It is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
It can answer complex but dichotomous questions such as:
• Probability of getting attending college (YES or NO), provided syllabus is over but faculty is interesting but boring at times depends upon the mood and behaviour of students in class.
• Probability of finishing the lunch sent my mother (YES or NO), depends upon multiple aspects, a) mood b) better options available c) food taste d) scolding by mom e) friends open treat and so further.
Hence, Logistic regression predicts the probability of an outcome that can only have two values (YES or NO)
10.12 CLUSTERING
TECHNIQUES
Clustering is an unsupervised learning model, similar to classification, it helps creating different set of classes together. It groups the similar types together by creating/identifying the clusters of the similar types. Clustering is a task of dividing homogeneous data types or population or groups. It does so by identifying the similar data types or nearby data elements over graph. In classification the classes are defined with the help algorithms or predefined classes are used and then the data inputs is considered, while in clustering the algorithm inputs itself decides with the help of inputs, the number of clusters depending upon it similarity traits. These similar set of inputs forms a group and called as clusters. Clustering is more dynamic model in terms of grouping.
Basic comparison for clustering and classification is given as below:
We can classify
clustering into two categories :
Soft clustering and hard clustering. Let me give one example to explain the same. For an instance we are developing a website for writing blogs. Now your blog belongs to a particular category such as : Science, Technology, Arts, Fiction etc. It might be possible that the article which is written could belong or relate 2 or more categories. Now, in this case if we restrict our blogger to choose one of the category then we would call this as “hard or strict clustering method”, where a user can remain in any one of the category. Let say this work is done automated by our piece of code and it chooses categories on the basis of the blog content. If my algorithm chooses any one of the given cluster for the blog then it would be called as “hard or strict clustering”. In contradiction to this if my algorithm chooses to select more than one cluster for the blog content then it would be called as “ soft or loose clustering” method.
Clustering methods should consider following important requirements:
• Robustness
• Flexibility
• Efficiency
Clustering
algorithms/Methods:
There are several clustering algorithms/Methods available, of which we will be explaining a few:
• Connectivity Clustering Method: This model is based on the connectivity between the data points. These models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away.
• Clustering Partition Method : it works on divisions method, where the division or partition between data set is created. These partitions are predefined non-empty sets. This is suitable for a small dataset.
• Centroid Cluster Method: This model revolve around the centre element of the dataset. The closest data points to the centre data point (centroid) in the dataset is considered to form a cluster. K-Means clustering algorithm is the best fit example of such model.
• Hierarchical clustering Method: This method describes the tree based structure (nested clusters) of the clusters. In this method we have clusters based on the divisions and their sub divisions in a hierarchy (nested clustering). The hierarchy can be pre-determined based upon user choice. Here number of clusters could remain dynamic and not needed to be predetermined as well.
• Density-based Clustering Method: In this method the density of the closest dataset is considered to form a cluster. The more number of closer data sets (denser the data inputs), the better the cluster formation. The problem here comes with outliers, which is handled in classifications (support vector machine) algorithm.
10.14 ANOVA
ANOVA is an acronym which stands for “ANalysis Of VAriance”. An ANOVA test is a way to find out if survey or experiment results are significant. In other words, they help you to figure out if you need to reject the null hypothesis or accept the alternate hypothesis.
Basically, you’re testing groups to see if there’s a difference between them. Examples of when you might want to test different groups:
• A group of psychiatric patients are trying three different therapies: counselling, medication and biofeedback. You want to see if one therapy is better than the others.
• A manufacturer has two different processes to make light bulbs. They want to know if one process is better than the other.
• Students from different colleges take the same exam. You want to see if one college outperforms the other.
Formula of ANOVA:
F= MST / MSE
Where, F = ANOVA Coefficient
MSE = Mean sum of squares due to treatment
MST = Mean sum of squares due to error
The ANOVA test is the initial step in analysing factors that affect a given data set. Once the test is finished, an analyst performs additional testing on the methodical factors that measurably contribute to the data set's inconsistency. The analyst utilizes the ANOVA test results in an f[1]test to generate additional data that aligns with the proposed regression models.
The ANOVA test allows a comparison of more than two groups at the same time to determine whether a relationship exists between them. The result of the ANOVA formula, the F statistic (also called the F-ratio), allows for the analysis of multiple groups of data to determine the variability between samples and within samples.
(citation: https://www.investopedia.com/terms/a/anova.asp, https://www.statisticshowto.com/probability-and-statistics/hypothesis[1]testing/anova/#:~:text=An%20ANOVA%20test%20is%20a,there's%20a% 20difference%20between%20them. )
10.15 PRINCIPAL
COMPONENT ANALYSIS (PCA)
PCA is actually a widely covered method on the web, and there are some great articles about it, but only few of them go straight to the point and explain how it works without diving too much into the technicalities and the ‘why’ of things. That’s the reason why i decided to make my own post to present it in a simplified way.
Before getting to the explanation, this post provides logical explanations of what PCA is doing in each step and simplifies the mathematical concepts behind it, as standardization, covariance, eigenvectors and eigenvalues without focusing on how to compute them.
Principal Component Analysis, or PCA, is a dimensionality[1]reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier 146 to explore and visualize and make analysing data much easier and faster for machine learning algorithms without extraneous variables to process. So to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while preserving as much information as possible.
( Citations: https://builtin.com/data-science/step-step-explanation[1]principal-component-analysis )
10.16 DECISION TREES
A decision tree represents classification. Decision tree learning is the most promising technique for supervised classification learning. Since it is a decision tree it is mend to take decision and being a learning decision tree it trains itself and learns from the experience of set of input iterations. These input iterations are also well known as “input training sets” or “training set data”.
Decision trees predict the future based on the previous learning and input rule sets. It taken multiple input values and returns back the probable output with the single value which is considered as a decision. The input/output could be continuous as well as discrete. A decision tree takes its decision based on the defined algorithms and the rule sets.
For example you want to take a decision to buy a pair of shoes. We start with few set of questions:
8. Do we need one?
9. What would be the budget?
10. Formal or informal?
11. Is it for a special occasion?
12. Which colour suits me better?
13. Which would be the most durable brand?
14. Shall we wait for some special sale or just buy one since its needed?
And similar more questions would give us a choice for selection. This prediction works on the classification where the choice of outputs are classified and the possibility of occurrence is decided on the basis of the probability of the occurrence of that particular output.
Example:
The above figure shows how a decision needs to be taken in a weather forecast scenario where the day is specified as Sunny, Cloudy or Rainy. Depending upon the metrics received by an algorithm it will take the decision. The metrics could be humidity, sky visibility and others. We can also see the cloudy situation having two possibilities of having partially cloudy and dense clouds, wherein having partial clouds is also a subset of a Sunny day. Such occurrences make decision tree bivalence.
10.17 SUPPORT VECTOR
MACHINES
Support vector machine is an algorithm which is used for classification in a supervised learning algorithm example. It does classification of the inputs received on the basis of the rule-set. It also works on Regression problems.
Classification is needed to differentiate 2 or more sets of similar data. Let us understand how it works.
Scene one:
The above scene shows A, B and C as 3 line segments creating hyper planes by dividing the plane. The graph here shows the 2 inputs circles and stars. The inputs could be from two classes. Looking at the scenario we can say that A is the line segment which is diving the 2 hyper planes showing 2 different input classes.
Scene two:
In the scene 2 we can see another rule, the one which cuts the better halves is considered. Hence , hyper plane C is the best choice of the algorithm.
Scene three:
Here in scene 3, we see one circle overlapping hyper plane A , hence according to rule 1 of scene 1 we will choose B which is cutting the co-ordinates into 2 better halves.
Scene four:
Scene 4 shows one hyper plane dividing the 2 better halves but there exist one extra circle co-ordinate in the other half hyperplane. We call this as an outlier which is generally discarded by the algorithm.
Scene five:
Scene 5 shows another strange scenario where we have the co[1]ordinates at all 4 quadrants. In this scenario we will fold the x-axis and cut y axis into two halves and transfer the stars and circle on one side of the quadrant and simplify the solution. The representation is shown below:
This gives us again a chance to divide the 2 classes into 2 better halves with using a hyperplane. In the above scenario we have scooped out the stars from the circle co-ordinates and shown it as a different hyperplane.
10.1Networks, Clusters, and Grids
10.2 Data Mining
10.3 Pattern Recognition
10.4 Machine Learning
10.5 Bagging Data
10.6 Random Forests
10.7 Computer Vision (CV)
10.8 Natural Language Processing (NLP)
10.9 Neural Networks
Artificial Neural Networks; the term is quite fascinating when any student starts learning it. Let us break down the term and know its meaning.
Artificial = means “man made”,
Neural = comes from the term Neurons in the brain; a complex structure of nerve cells with keeps the brain functioning. Neurons are the vital part of human brain which does simple input/output to complex problem solving in the brain.
Network = A connection of two entities (here in our case “Neurons”, not just two but millions of them).
What is a Neural
Network ?:
Neural network is a network of nerve cells in the brain. There are about 100 million neurons in our brain. Let us know few more facts about human brain.
Well we all have one 1200 gms of brain (appx). Try weighing it. ?.. That’s a different thing that few have kidney beans Inside their skull.?? For those who think they have a tiny little brain. Let me share certain things with you.
Our brain weighs appx 1200-1500 gms.
• It has a complex structure of neurons (nerve cells) with a lot of grey matter. Which is very essential to keep you working fine.
• There are around 100 billion neurons in our brain, which keeps our brain and body functioning by constantly transferring data 24×7 till we are alive.
• The data transfer is achieved by the exchange of electrical or chemical signals with the help of synapses ( junction between two neurons).
• They exchange about 1000 trillion synaptic signals per second, which is equivalent to 1 trillion bit per second computer processor.
• The amount of energy generated with these synaptic signal exchange is enough to light a bulb of 5 volts.
• A human brain hence can also store upto 1000 terabytes of data.
• The information transfer happen with the help of the synaptic exchange.
• It takes 7 years to replace every single neuron in the brain. So we tend to forget the content after 7 years, since during the synaptic exchange there is a loss of energy, means loss of information. If we do not recall anything for 7 years then that information is completely erased from our memory.
• Similar to these neurons computer scientist build a complex Artificial neural network using the array of logic gates. The most preferred gates used are XOR gates.
• The best part about Artificial brain is that it can store and cannot forget like human brain and we can store much more information than an individual brain. Unfortunately that has a bigger side effect. Trust me “forgetting is better”. Certain things in our lives we should forget so we can move forward. Imagine if you would have to live with every memory in life. Both negative and positive. Things will haunt you and you will start the journey of psychotic disorders.. ??
Scary ? Isn’t it !!
Well there are many assumptions with their probable outcome. We always need to look at the positive side of it. Artificial neural networks(ANN) are very useful for solving complex problems and decision making. ANN is an artificial representation of a human brain that tries to simulate its various functions such as learning, calculating , understanding, decision making and many more. Unfortunately it could not reach the exact human brain like function. It is a connection of logical gates which uses mathematical computational model to work and give the output.
After WWII, in year 1943 , Warren McCulloch and Walter Pitts modelled artificial neuron to perform computation. They did this by developing a neuron from a logic gate.
Here, the neuron is actually a processing unit, it calculates the weighted sum of the input signal to the neuron to generate the activation signal a, given by :
Another representation showing detailed multiple neurons working.
Here it shows that, the inputs of all neurons is calculated along with their weights. Hence the weighted sum of all the inputs XiWi ( X1W1, X2W2, X3W3.......XnWn), Where X represents input signals and W represents weights is considered as an output to the equation “a”.
These neurons are connected in a long logical network to create a polynomial function(s). So that to calculate multiple complex problems. In the architecture, more element needs to be added that is a threshold.
Threshold defines the limits to the model. Threshold is defined as THETA ( Θ ) in neural network model. It is added or subtracted to the output depending upon the model definitions.
This theta defines additional limits acting as a filter to the inputs. With the help of which we can filter out unwanted stuff and get more focus on the needed ones. Another fact about theta is that its value is dynamic according to the environment. For an instance it can be understood as + or - tolerance value in semiconductors / resistors.
10.27 TENSORFLOW
TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications. To know more visit https://www.tensorflow.org/ and study further.
ORGANIZE AND REPORT
SUPERSTEPS
Organize Superstep, Report Superstep, Graphics, Pictures, Showing the Difference
(citation: from the book: Practical Data Science by Andreas François Vermeulen)
11.1 ORGANIZE
SUPERSTEP
The Organize superstep takes the complete data warehouse you built at the end of the Transform superstep and subsections it into business-specific data marts. A data mart is the access layer of the data warehouse environment built to expose data to the users. The data mart is a subset of the data warehouse and is generally oriented to a specific business group.
Horizontal Style:
Performing horizontal-style slicing or subsetting of the data warehouse is achieved by applying a filter technique that forces the data warehouse to show only the data for a specific preselected set of filtered outcomes against the data population. The horizontal-style slicing selects the subset of rows from the population while preserving the columns.
That is, the data science tool can see the complete record for the records in the subset of records.
Vertical Style:
Performing vertical-style slicing or subsetting of the data warehouse is achieved by applying a filter technique that forces the data warehouse to show only the data for specific preselected filtered outcomes against the data population. The vertical-style slicing selects the subset of columns from the population, while preserving the rows.
That is, the data science tool can see only the preselected columns from a record for all the records in the population.
Island Style:
Performing island-style slicing or subsetting of the data warehouse is achieved by applying a combination of horizontal- and vertical-style slicing. This generates a subset of specific rows and specific columns reduced at the same time.
Secure Vault Style:
The secure vault is a version of one of the horizontal, vertical, or island slicing techniques, but the outcome is also attached to the person who performs the query.
This is common in multi-security environments, where different users are allowed to see different data sets.
This process works well, if you use a role-based access control (RBAC) approach to restricting system access to authorized users. The security is applied against the “role,” and a person can then, by the security system, simply be added or removed from the role, to enable or disable access.
The security in most data lakes I deal with is driven by an RBAC model that is an approach to restricting system access to authorized users by allocating them to a layer of roles that the data lake is organized into to support security access.
It is also possible to use a time-bound RBAC that has different access rights during office hours than after hours.
Association Rule
Mining:
Association rule learning is a rule-based machine-learning method for discovering interesting relations between variables in large databases, similar to the data you will find in a data lake. The technique enables you to investigate the interaction between data within the same population.
This example I will discuss is also called “market basket analysis.” It will investigate the analysis of a customer’s purchases during a period of time.
The new measure you need to understand is called “lift.” Lift is simply estimated by the ratio of the joint probability of two items x and y, divided by the product of their individual probabilities:
If the two items are statistically independent, then P(x,y) = P(x)P(y), corresponding to Lift = 1, in that case. Note that anti-correlation yields lift values less than 1, which is also an interesting discovery, corresponding to mutually exclusive items that rarely co-occur.
You will require the following additional library: conda install -c conda[1]forge mlxtend.
The general algorithm used for this is the Apriori algorithm for frequent item set mining and association rule learning over the content of the data lake. It proceeds by identifying the frequent individual items in the data lake and extends them to larger and larger item sets, as long as those item sets appear satisfactorily frequently in the data lake.
The frequent item sets determined by Apriori can be used to determine association rules that highlight common trends in the overall data lake. I will guide you through an example.
Start with the
standard ecosystem.
(citation: from the book: Practical Data Science by Andreas François Vermeulen)
11.2 Report Superstep
The Report superstep is the step in the ecosystem that enhances the data science findings with the art of storytelling and data visualization. You can perform the best data science, but if you cannot execute a respectable and trustworthy Report step by turning your data science into actionable business insights, you have achieved no advantage for your business.
Summary of the
Results:
The most important step in any analysis is the summary of the results. Your data science techniques and algorithms can produce the most methodically, most advanced mathematical or most specific statistical results to the requirements, but if you cannot summarize those into a good story, you have not achieved your requirements.
Understand the
Context:
What differentiates good data scientists from the best data scientists are not the algorithms or data engineering; it is the ability of the data scientist to apply the context of his findings to the customer.
Appropriate
Visualization:
It is true that a picture tells a thousand words. But in data science, you only want your visualizations to tell one story: the findings of the data science you prepared. It is absolutely necessity to ensure that your audience will get your most important message clearly and without any other meanings.
Practice with your visual tools and achieve a high level of proficiency. I have seen numerous data scientists lose the value of great data science results because they did not perform an appropriate visual presentation.
Eliminate Clutter:
Have you ever attended a presentation where the person has painstakingly prepared 50 slides to feedback his data science results? The most painful image is the faces of the people suffering through such a presentation for over two hours.
Have you ever attended a presentation where the person has painstakingly prepared 50 slides to feedback his data science results? The most painful image is the faces of the people suffering through such a presentation for over two hours.
11.3 GRAPHICS,
PICTURES
Graphic visualisation is the most important part of data science. Hence plotting the graphical representation using python matplotlib and such libraries for data visualisation is prominent and useful. Try using certain libraries and plot the outcomes of :
• Pie Graph
• Double Pie
• Line Graph
• Bar Graph
• Horizontal Bar Graph
• Area graph
• Scatter Graph
And so further.
Channels of Images
The interesting fact about any picture is that it is a complex data set in every image
Pictures are built using many layers or channels that assists the visualization tools to render the required image.
Open your Python editor, and let’s investigate the inner workings of an image.
No comments:
Post a Comment
Tell your requirements and How this blog helped you.