google.com, pub-4497197638514141, DIRECT, f08c47fec0942fa0 Industries Needs: DATA SCIENCE TECHNOLOGY STACK

Tuesday, February 15, 2022

DATA SCIENCE TECHNOLOGY STACK


TRANSFORM SUPERSTEP

9.0 OBJECTIVES

The objective of this chapter is to learn the data transformation where it brings data to knowledge and converts results into insights.

 

9.1 INTRODUCTION

The Transform superstepallows to take data from the data vault and formulate answers to questions. The transformation step is the data science process that converts results into meaningful insights.

 

9.2 OVERVIEW

As to explain the below scenario is shown.

Data is categorised in to 5 different dimensions:

1. Time

2. Person

3. Object

4. Location

5. Event

 

9.3 DIMENSION CONSOLIDATION

The data vault consists of five categories of data, with linked relationships andadditional characteristics in satellite hubs.

 


9.4 THE SUN MODEL

The use of sun models is a technique that enables the data scientist to perform consistent dimension consolidation, by explaining the intended data relationship with the business, without exposing it to the technical details required to complete the transformation processing.

The sun model is constructed to show all the characteristics from the two data vault hub categories you are planning to extract. It explains how you will create two dimensions and a fact via the Transform step.

 

9.5 TRANSFORMING WITH DATA SCIENCE

9.5.1 Missing value treatment:

You must describe in detail what the missing value treatments are for the data lake transformation. Make sure you take your business community with you along the journey. At the end of the process, they must trust your techniques and results. If they trust the process, they will implement the business decisions that you, as a data scientist, aspire to achieve.

 

Why Missing value treatment is required?

Explain with notes on the data traceability matrix why there is missing data in the datalake. Remember: Every inconsistency in the data lake is conceivably the missing insightyour customer is seeking from you as a data scientist. So, find them and explain them.Your customer will exploit them for business value.

 

Why Data Has Missing Values ?

The 5 Whys is the technique that helps you to get to the root cause of your analysis. The use of cause-and-effect fishbone diagrams will assist you to resolve those questions.I have found the following common reasons for missing data:

• Data fields renamed during upgrades

• Migration processes from old systems to new systems wheremappings were incomplete

• Incorrect tables supplied in loading specifications by subject[1]matterexpert

• Data simply not recorded, as it was not available

• Legal reasons, owing to data protection legislation, such as theGeneral Data Protection Regulation (GDPR), resulting in a not-to-processtag on the data entry

• Someone else’s “bad” data science. People and projects makemistakes, and you will have to fix their errors in your own datascience.

 

9.6.2 Techniques of outlier detection and Treatment:

 

9.7 HYPOTHESIS TESTING

Hypothesis testing is not precisely an algorithm, but it’s a must[1]know for any datascientist. You cannot progress until you have thoroughly mastered this technique.Hypothesis testing is the process by which statistical tests are used to check if ahypothesis is true, by using data. Based on hypothetical testing, data scientists chooseto accept or reject the hypothesis. When an event occurs, it can be a trend or happen bychance. To check whether the event is an important occurrence or justhappenstance,hypothesis testing is necessary. There are many tests for hypothesis testing, but the following two are most popular. T-test and Chi-Square test.

 

9.8 CHI-SQUARE TEST

There are two types of chi-square tests. Both use the chi-square statistic and distribution for different purposes:

A chi-square goodness of fit test determines if a sample data matches a population. For more details on this type, see: Goodness of Fit Test.

A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.A very small chi square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship.A very large chi square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.

 

9.9 UNIVARIATE ANALYSIS

Univariate analysis is the simplest form of analysing data. “Uni” means “one”, so in other words your data has only one variable. It doesn’t deal with causes or relationships (unlike regression) and its major purpose is to describe; It takes data, summarizes that data and finds patterns in the data.

Univariate analysis is used to identify those individual metabolites which, either singly or multiplexed, are capable of differentiating between biological groups, such as separating tumour bearing mice from nontumor bearing (control) mice. Statistical procedures used for this analysis include a t-test, ANOVA, Mann–Whitney U test, Wilcoxon signed-rank test, and logistic regression. These tests are used to individually or globally screen the measured metabolites for an association with a disease.

 

9.10 BIVARIATE ANALYSIS

Bivariate analysis is when two variables are analysed together for any possible associationor empirical relationship, such as, for example, the correlationbetween gender andgraduation with a data science degree? Canonical correlation in the experimental contextis to take two sets of variables and see what is common between the two sets.Graphs that are appropriate for bivariate analysis depend on the type of variable.For two continuous variables, a scatterplot is a common graph. When one variableis categorical and the other continuous, a box plot is common, and when both arecategorical, a mosaic plot is common.

 

9.11 MULTIVARIATE ANALYSIS

A single metabolite biomarker is generally insufficient to differentiate between groups. For this reason, a multivariate analysis, which identifies sets of metabolites (e.g., patterns or clusters) in the data, can result in a higher likelihood of group separation. Statistical methods for this analysis include unsupervised methods, such as principle component analysis (PCA) or cluster analysis, and supervised methods, such as latent Dirichlet allocation (LDA), partial least squares (PLS), PLS Discriminant Analysis (PLS-DA), artificial neural network (ANN), and machine learning methods. These methods provide an overview of a large dataset that is useful for identifying patterns and clusters in the data and expressing the data to visually highlight similarities and differences. Unsupervised methods may reduce potential bias since the classes are unlabelled.

Regardless of one's choice of method for statistical analysis, it is necessary to subsequently validate the identified potential biomarkers and therapeutic targets by examining them in new and separate sample sets (for biomarkers), and in vitro and/or in vivo experiments evaluating the identified pathways or molecules (for therapeutic targets)

 

9.13 LOGISTIC REGRESSION

Logistic regression is one another technique, used for converting binary classification(dichotomous) problem to linear regression problems. Logistic regression is a predictive analysistechnique. This could be difficult to interpret so there are tools available for it. It is used to describedata and to explain the relationship between one dependent binary variable and one or more nominal,ordinal, interval or ratio-level independent variables.

It can answer complex but dichotomous questions such as:

• Probability of getting attending college (YES or NO), provided syllabus is over but facultyis interesting but boring at times depends upon the mood and behaviour of students in class.

• Probability of finishing the lunch sent my mother (YES or NO), depends upon multipleaspects, a) mood b) better options available c) food taste d) scoldingby mom e) friends opentreat and so further.

Hence, Logistic regression predicts the probability of an outcome that can onlyhave two values(YES or NO)

 

9.14 CLUSTERING TECHNIQUES

Clustering is an unsupervised learning model, similar to classification, it helps creating different set of classes together. It groups the similar types together by creating/identifying the clusters of the similar types. Clustering is a task of dividing homogeneous data types or population or groups. It does so by identifying the similar data types or nearby data elements over graph. In classification the classes are defined with the help algorithms or predefined classes are used and then the data inputs is considered, while in clustering the algorithm inputs itself decides with the help of inputs, the number of clusters depending upon it similarity traits. These similar set of inputs forms a group and called as clusters. Clustering is more dynamic model in terms of grouping.

 

Basic comparison for clustering and classification is given as below:

 


We can classify clustering into two categories :

Soft clustering and hard clustering. Let me give one example to explain the same. For an instance we are developing a website for writing blogs. Now your blog belongs to a particular category such as : Science, Technology, Arts, Fiction etc. It might be possible that the article which is written could belong or relate 2 or more categories. Now, in this case if we restrict our blogger to choose one of the category then we would call this as “hard or strict clustering method”, where a user can remain in any one of the category. Let say this work is done automated by our piece of code and it chooses categories on the basis of the blog content. If my algorithm chooses any one of the given cluster for the blog then it would be called as “hard or strict clustering”. In contradiction to this if my algorithm chooses to select more than one cluster for the blog content then it would be called as “ soft or loose clustering” method.

 

Clustering methods should consider following important requirements:

• Robustness

• Flexibility

• Efficiency

 

Clustering algorithms/Methods:

There are several clustering algorithms/Methods available, of which we will be explaining a few:

Connectivity Clustering Method: This model is based on the connectivity between the data points. These models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away.

Clustering Partition Method: it works on divisions method, where the division or partition between data set is created. These partitions are predefined non-empty sets. This is suitable for a small dataset.

Centroid Cluster Method: This model revolve around the centre element of the dataset. The closest data points to the centre data point (centroid) in the dataset is considered to form a cluster. K-Means clustering algorithm is the best fit example of such model.

Hierarchical clustering Method: This method describes the tree based structure (nested clusters) of the clusters. In this method we have clusters based on the divisions and their sub divisions in a hierarchy (nested clustering). The hierarchy can be pre-determined based upon user choice. Here number of clusters could remain dynamic and not needed to be predetermined as well.

Density-based Clustering Method: In this method the density of the closest dataset is considered to form a cluster. The more number of closer data sets (denser the data inputs), the better the cluster formation. The problem here comes with outliers, which is handled in classifications (support vector machine) algorithm.

9.15 ANOVA

9.16 Principal Component Analysis (PCA)

9.17 Decision Trees

A decision tree represents classification. Decision tree learning is the most promising technique for supervised classification learning. Since it is a decision tree it is mend to take decision and being a learning decision tree it trains itself and learns from the experience of set of input iterations. These input iterations are also well known as “input training sets” or “training set data”.

Decision trees predict the future based on the previous learning and input rule sets. It taken multiple input values and returns back the probable output with the single value which is considered as a decision. The input/output could be continuous as well as discrete. A decision tree takes its decision based on the defined algorithms and the rule sets.

For example you want to take a decision to buy a pair of shoes. We start with few set of questions:

1. Do we need one?

2. What would be the budget?

3. Formal or informal?

4. Is it for a special occasion?

5. Which colour suits me better?

6. Which would be the most durable brand?

7. Shall we wait for some special sale or just buy one since its needed?

 

And similar more questions would give us a choice for selection. This prediction works on the classification where the choice of outputs are classified and the possibility of occurrence is decided on the basis of the probability of the occurrence of that particular output.

 

Example:

 


The above figure shows how a decision needs to be taken in a weather forecast scenario where the day is specified as Sunny, Cloudy or Rainy. Depending upon the metrics received by an algorithm it will take the decision. The metrics could be humidity, sky visibility and others. We can also see the cloudy situation having two possibilities of having partially cloudy and dense clouds, wherein having partial clouds is also a subset of a Sunny day. Such occurrences make decision tree bivalence.

 

9.18 SUPPORT VECTOR MACHINES

Support vector machine is an algorithm which is used for classification in a supervised learning algorithm example. It does classification of the inputs received on the basis of the rule-set. It also works on Regression problems.

Classification is needed to differentiate 2 or more sets of similar data. Let us understand how it works.

Scene one:

 


The above scene shows A, B and C as 3 line segments creating hyper planes by dividing the plane. The graph here shows the 2 inputs circles and stars. The inputs could be from two classes. Looking at the scenario we can say that A is the line segment which is diving the 2 hyper planes showing 2 different input classes.

 

Scene two:

 


In the scene 2 we can see another rule, the one which cuts the better halves is considered. Hence , hyper plane C is the best choice of the algorithm.

 

Scene three:

 


Here in scene 3, we see one circle overlapping hyper plane A , hence according to rule 1 of scene 1 we will choose B which is cutting the co-ordinates into 2 better halves.

 

Scene four:


Scene 4 shows one hyper plane dividing the 2 better halves but there exist one extra circle co-ordinate in the other half hyperplane. We call this as an outlier which is generally discarded by the algorithm.

 

Scene five:

 


Scene 5 shows another strange scenario where we have the co[1]ordinates at all 4 quadrants. In this scenario we will fold the x-axis and cut y axis into two halves and transfer the stars and circle on one side of the quadrant and simplify the solution. The representation is shown below:

 


This gives us again a chance to divide the 2 classes into 2 better halves with using a hyperplane. In the above scenario we have scooped out the stars from the circle co-ordinates and shown it as a different hyperplane.

9.19 Networks, Clusters, and Grids

9.20 Data Mining

9.21 Pattern Recognition

9.22 Machine Learning

9.23 Bagging Data

9.24 Random Forests

9.25 Computer Vision (CV)

9.26 Natural Language Processing (NLP)

9.27 Neural Networks

 

Artificial Neural Networks; the term is quite fascinating when any student starts

learning it. Let us break down the term and know its meaning.

Artificial = means “man made”,

Neural = comes from the term Neurons in the brain; a complex structure of nerve cells with keeps the brain functioning. Neurons are the vital part of human brain which does simple input/output to complex problem solving in the brain.

Network = A connection of two entities (here in our case “Neurons”, not just two but millions of them).

What is a Neural Network ?

Neural network is a network of nerve cells in the brain. There are about 100 million neurons in our brain. Let us know few more facts about human brain.

Well we all have one 1200 gms of brain (appx). Try weighing it. ?..That’s a different thing that few have kidney beans Inside their skull.??For those who think they have a tiny little brain. Let me share certain things with you.

Our brain weighs appx 1200-1500 gms.

• It has a complex structure of neurons (nerve cells) with a lot of grey matter. Which is very essential to keep you working fine.

• There are around 100 billion neurons in our brain, which keeps our brain and body functioning by constantly transferring data 24×7 till we are alive.

• The data transfer is achieved by the exchange of electrical or chemical signals with the help of synapses ( junction between two neurons).

• They exchange about 1000 trillion synaptic signals per second, which is equivalent to 1 trillion bit per second computer processor.

• The amount of energy generated with these synaptic signal exchange is enough to light a bulb of 5 volts.

• A human brain hence can also store upto 1000 terabytes of data.

• The information transfer happen with the help of the synaptic exchange.

• It takes 7 years to replace every single neuron in the brain. So we tend to forget the content after 7 years, since during the synaptic exchange there is a loss of energy, means loss of information. If we do not recall anything for 7 years then that information is completely erased from our memory.

• Similar to these neurons computer scientist build a complex Artificial neural network using the array of logic gates. The most preferred gates used are XOR gates.

• The best part about Artificial brain is that it can store and cannot forget like human brain and we can store much more information than an individual brain. Unfortunately that has a bigger side effect. Trust me “forgetting is better”. Certain things in our lives we should forget so we can move forward. Imagine if you would have to live with every memory in life. Both negative and positive. Things will haunt you and you will start the journey of psychotic disorders.. ??

Scary ? Isn’t it !!

Well there are many assumptions with their probable outcome. We always need to look at the positive side of it. Artificial neural networks(ANN) are very useful for solving complex problems and decision making. ANN is an artificial representation of a human brain that tries to simulate its various functions such as learning, calculating, understanding, decision making and many more. Unfortunately it could not reach the exact human brain like function. It is a connection of logical gates which uses mathematical computational model to work and give the output.

After WWII, in year 15.143 , Warren McCulloch and Walter Pitts modelled artificial neuron to perform computation. They did this by developing a neuron from a logic gate.

Here, the neuron is actually a processing unit, it calculates the weighted sum of the input signal to the neuron to generate the activation signal a, given by :

 


Another representation showing detailed multiple neurons working.

Here it shows that, the inputs of all neurons is calculated along with their weights. Hence the weighted sum of all the inputs XiWi( X1W1, X2W2, X3W3.......XnWn), Where X represents input signals and W represents weights is considered as an output to the equation “a”.

These neurons are connected in a long logical network to create a polynomial function(s). So that to calculate multiple complex problems.

In the architecture, more element needs to be added that is a threshold.

Threshold defines the limits to the model. Threshold is defined as THETA ( Θ ) in neural network model. It is added or subtracted to the output depending upon the model definitions.

This theta defines additional limits acting as a filter to the inputs. With the help of which we can filter out unwanted stuff and get more focus on the needed ones. Another fact about theta is that its value is dynamic according to the environment. For an instance it can be understood as + or - tolerance value in semiconductors / resistors.

 

9.28 TensorFlow

 

No comments:

Post a Comment

Tell your requirements and How this blog helped you.

Labels

ACTUATORS (10) AIR CONTROL/MEASUREMENT (38) ALARMS (20) ALIGNMENT SYSTEMS (2) Ammeters (12) ANALYSERS/ANALYSIS SYSTEMS (33) ANGLE MEASUREMENT/EQUIPMENT (5) APPARATUS (6) Articles (3) AUDIO MEASUREMENT/EQUIPMENT (1) BALANCES (4) BALANCING MACHINES/SERVICES (1) BOILER CONTROLS/ACCESSORIES (5) BRIDGES (7) CABLES/CABLE MEASUREMENT (14) CALIBRATORS/CALIBRATION EQUIPMENT (19) CALIPERS (3) CARBON ANALYSERS/MONITORS (5) CHECKING EQUIPMENT/ACCESSORIES (8) CHLORINE ANALYSERS/MONITORS/EQUIPMENT (1) CIRCUIT TESTERS CIRCUITS (2) CLOCKS (1) CNC EQUIPMENT (1) COIL TESTERS EQUIPMENT (4) COMMUNICATION EQUIPMENT/TESTERS (1) COMPARATORS (1) COMPASSES (1) COMPONENTS/COMPONENT TESTERS (5) COMPRESSORS/COMPRESSOR ACCESSORIES (2) Computers (1) CONDUCTIVITY MEASUREMENT/CONTROL (3) CONTROLLERS/CONTROL SYTEMS (35) CONVERTERS (2) COUNTERS (4) CURRENT MEASURMENT/CONTROL (2) Data Acquisition Addon Cards (4) DATA ACQUISITION SOFTWARE (5) DATA ACQUISITION SYSTEMS (22) DATA ANALYSIS/DATA HANDLING EQUIPMENT (1) DC CURRENT SYSTEMS (2) DETECTORS/DETECTION SYSTEMS (3) DEVICES (1) DEW MEASURMENT/MONITORING (1) DISPLACEMENT (2) DRIVES (2) ELECTRICAL/ELECTRONIC MEASUREMENT (3) ENCODERS (1) ENERGY ANALYSIS/MEASUREMENT (1) EQUIPMENT (6) FLAME MONITORING/CONTROL (5) FLIGHT DATA ACQUISITION and ANALYSIS (1) FREQUENCY MEASUREMENT (1) GAS ANALYSIS/MEASURMENT (1) GAUGES/GAUGING EQUIPMENT (15) GLASS EQUIPMENT/TESTING (2) Global Instruments (1) Latest News (35) METERS (1) SOFTWARE DATA ACQUISITION (2) Supervisory Control - Data Acquisition (1)