Machine learning is a branch in computer science that studies the design of algorithms that can learn. Typical machine learning tasks are concept learning, function learning or “predictive modeling”, clustering and finding predictive patterns. These tasks are learned through available data that were observed through experiences or instructions, for example. Machine learning hopes that including the experience into its tasks will eventually improve the learning. The ultimate goal is to improve the learning in such a way that it becomes automatic, so that humans like ourselves don’t need to interfere any more.
In supervised learning (SML), the learning algorithm is presented with labelled example inputs, where the labels indicate the desired output. SML itself is composed of classification, where the output is categorical, and regression, where the output is numerical.
In unsupervised learning (UML), no labels are provided, and the learning algorithm focuses solely on detecting structure in unlabelled input data.
Note that there are also semi-supervised learning approaches that use labelled data to inform unsupervised learning on the unlabelled data to identify and annotate new classes in the dataset (also called novelty detection).
Reinforcement learning, the learning algorithm performs a task using feedback from operating in a real or synthetic environment.
Broadly, there are 3
types of Machine Learning Algorithms
1. Supervised
Learning
2. Unsupervised
Learning
3. Reinforcement
Learning:
List of Common
Machine Learning Algorithms
Here is the list of commonly used machine learning algorithms. These algorithms can be applied to almost any data problem:
1. Linear Regression
2. Logistic Regression
3. Decision Tree
4. SVM 5. Naive Bayes
6. kNN
7. K-Means
8. Random Forest
9. Dimensionality Reduction Algorithms
10. Gradient Boosting algorithms
1. GBM
2. XGBoost
3. LightGBM
4. CatBoost
Regression analysis consists of a set of machine learning methods that allow us to predict a continuous outcome variable (y) based on the value of one or multiple predictor variables (x).
Briefly, the goal of regression model is to build a mathematical equation that defines y as a function of the x variables. Next, this equation can be used to predict the outcome (y) on the basis of new values of the predictor variables (x).
Linear regression is the most simple and popular technique for predicting a continuous variable. It assumes a linear relationship between the outcome and the predictor variables.
The linear regression equation can be written as y = b0 + b*x + e, where:
·b0 is the intercept,
· b is the regression weight or coefficient associated with the predictor variable x.
· e is the residual error
Technically, the linear regression coefficients are detetermined so that the error in predicting the outcome value is minimized. This method of computing the beta coefficients is called the Ordinary Least Squares method.
When you have multiple predictor variables, say x1 and x2, the regression equation can be written as y = b0 + b1*x1 + b2*x2 +e. In some situations, there might be an interaction effect between some predictors, that is for example, increasing the value of a predictor variable x1 may increase the effectiveness of the predictor x2 in explaining the variation in the outcome variable.
Note also that, linear regression models can incorporate both continuous and categorical predictor variables. When you build the linear regression model, you need to diagnostic whether linear model is suitable for your data. In some cases, the relationship between the outcome and the predictor variables is not linear. In these situations, you need to build a non-linear regression, such as polynomial and spline regression.
When you have multiple predictors in the regression model, you might want to select the best combination of predictor variables to build an optimal predictive model. This process called model selection, consists of comparing multiple models containing different sets of predictors in order to select the best performing model that minimize the prediction error. Linear model selection approaches include best subsets regression and stepwise regression.
In some situations, such as in genomic fields, you might have a large multivariate data set containing some correlated predictors. In this case, the information, in the original data set, can be summarized into few new variables (called principal components) that are a linear combination of the original variables. This few principal components can be used to build a linear model, which might be more performant for your data. This approach is know as principal component-based methods, which include: principal component regression and partial least squares regression.
An alternative method to simplify a large multivariate model is to use penalized regression, which penalizes the model for having too many variables. The most well known penalized regression include ridge regression and the lasso regression.
You can apply all these different regression models on your data, compare the models and finally select the best approach that explains well your data. To do so, you need some statistical metrics to compare the performance of the different models in explaining your data and in predicting the outcome of new test data.
The best model is defined as the model that has the lowest prediction error. The most popular metrics for comparing regression models, include:
· Root Mean Squared Error, which measures the model prediction error. It corresponds to the average difference between the observed known values of the outcome and the predicted value by the model. RMSE is computed as RMSE = mean((observeds - predicteds)^2) %>% sqrt(). The lower the RMSE, the better the model.
· Adjusted R-square, representing the proportion of variation (i.e., information), in your data, explained by the model. This corresponds to the overall quality of the model. The higher the adjusted R2, the better the model
Note that, the above mentioned metrics should be computed on a new test data that has not been used to train (i.e. build) the model. If you have a large data set, with many records, you can randomly split the data into training set (80% for building the predictive model) and test set or validation set (20% for evaluating the model performance).
One of the most robust and popular approach for estimating a model performance is k-fold cross-validation. It can be applied even on a small data set. k-fold cross-validation works as follow:
1. Randomly split the data set into k-subsets (or k-fold) (for example 5 subsets)
2. Reserve one subset and train the model on all other subsets
3. Test the model on the reserved subset and record the prediction error
4. Repeat this process until each of the k subsets has served as the test set.
5. Compute the average of the k recorded errors. This is called the cross-validation error serving as the performance metric for the model.
Clustering in Machine
Learning
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It can be defined as "A way of grouping the data points into different clusters, consisting of similar data points. The objects with the possible similarities remain in a group that has less or no similarities with another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior, etc., and divides them as per the presence and absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system can use this id to simplify the processing of large and complex datasets.
The clustering technique is commonly used for statistical data analysis.
Example: Let's understand the clustering technique with the real-world example of Mall: When we visit any shopping mall, we can observe that the things with similar usage are grouped together. Such as the t-shirts are grouped in one section, and trousers are at other sections, similarly, at vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can easily find out the things. The clustering technique also works in the same way. Other examples of clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses of this technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to provide the recommendations as per the past search of products. Netflix also uses this technique to recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different fruits are divided into several groups with similar properties.
Types of Clustering
Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one group) and Soft Clustering (data points can belong to another group also). But there are also other various approaches of Clustering exist. Below are the main clustering methods used in Machine learning:
1. Partitioning
Clustering
2. Density-Based
Clustering
3. Distribution
Model-Based Clustering
4. Hierarchical
Clustering
5. Fuzzy Clustering
Partitioning
Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the centroid-based method. The most common example of partitioning clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined groups. The cluster center is created in such a way that the distance between the data points of one cluster is minimum as compared to another cluster centroid.
Density-Based
Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily shaped distributions are formed as long as the dense region can be connected. This algorithm does it by identifying different clusters in the dataset and connects the areas of high densities into clusters. The dense areas in data space are divided from each other by sparser areas.
The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian Mixture Models (GMM).
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or cluster. Each dataset has a set of membership coefficients, which depend on the degree of membership to be in a cluster. Fuzzy C[1]means algorithm is the example of this type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above. There are different types of clustering algorithms published, but only a few are commonly used. The clustering algorithm is based on the kind of data that we are using. Such as, some algorithms need to guess the number of clusters in the given dataset, whereas some are required to find the minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It classifies the dataset by dividing the samples into different clusters of equal variances. The number of clusters must be specified in this algorithm. It is fast with fewer computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density of data points. It is an example of a centroid-based model, that works on updating the candidates for centroid to be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise. It is an example of a density-based model similar to the mean-shift, but with some remarkable advantages. In this algorithm, the areas of high density are separated by the areas of low density. Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative for the k-means algorithm or for those cases where K-means can be failed. In GMM, it is assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs the bottom[1]up hierarchical clustering. In this, each data point is treated as a single cluster at the outset and then successively merged. The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require to specify the number of clusters. In this, each data point sends a message between the pair of data points until convergence. It has O(N2T) time complexity, which is the main drawback of this algorithm.
Applications of
Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely used for the identification of cancerous cells. It divides the cancerous and non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search result appears based on the closest object to the search query. It does it by grouping similar data objects in one group that is far from the other dissimilar objects. The accurate result of a query depends on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS database. This can be very useful to find that for what purpose the particular land should be used, that means for which purpose it is more suitable.
o Customer Segmentation: It is used in market research to segment the customers based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS database. This can be very useful to find that for what purpose the particular land should be used, that means for which purpose it is more suitable.
Collaborative
Filtering with R
Collaborative filtering is another technique that can be used for recommendation. The underlying concept behind this technique is as follows:
§ Assume Person A likes Oranges, and Person B likes Oranges.
§ Assume Person A likes Apples.
§ Person B is likely to have similar opinions on Apples as A than some other random person.
The implications of collaborative filtering are obvious: you can predict and recommend items to users based on preference similarities. There are two types of collaborative filtering: user-based and item-based.
Item Based Collaborative Filtering takes the similarities between items’ consumption history. User Based Collaborative Filtering considers similarities between user consumption history.
Association Rule
Mining
Association Rule Mining is used when you want to find an association between different objects in a set, find frequent patterns in a transaction database, relational databases or any other information repository. The applications of Association Rule Mining are found in Marketing, Basket Data Analysis (or Market Basket Analysis) in retailing, clustering and classification. It can tell you what items do customers frequently buy together by generating a set of rules called Association Rules. In simple words, it gives you output as rules in form if this then that. Clients can use those rules for numerous marketing strategies:
· Changing the store layout according to trends
· Customer behavior analysis
· Catalogue design
· Cross marketing on online stores
· What are the trending items customers buy
· Customized emails with add-on sales
Consider the
following example:
Given is a set of transaction data. You can see transactions numbered 1 to 5. Each transaction shows items bought in that transaction. You can see that Diaper is bought with Beer in three transactions. Similarly, Bread is bought with milk in three transactions making them both frequent item sets. Association rules are given in the form as below:
A=>B[Support,Confidence]A=>B[Support,Confidence]
The part before =>=> is referred to as if (Antecedent) and the part after =>=> is referred to as then (Consequent).
Where A and B are sets of items in the transaction data. A and B are disjoint sets.
Computer=>Anti−virusSoftware[Support=20%,confidence=60%]Computer=>Anti−virusSoftwar e[Support=20%,confidence=60%]
Above rule says:
1. 20% transaction show Anti-virus software is bought with purchase of a Computer
2. 60% of customers who purchase Anti-virus software is bought with purchase of a Computer
In the following section you will learn about the basic concepts of Association Rule Mining: Basic Concepts of Association Rule Mining.
Market Basket
Analysis using R
Learn about Market Basket Analysis & the APRIORI Algorithm that works behind it. You'll see how it is helping retailers boost business by predicting what items customers buy together.
You are a data scientist (or becoming one!), and you get a client who runs a retail store. Your client gives you data for all transactions that consists of items bought in the store by several customers over a period of time and asks you to use that data to help boost their business. Your client will use your findings to not only change/update/add items in inventory but also use them to change the layout of the physical store or rather an online store. To find results that will help your client, you will use Market Basket Analysis (MBA) which uses Association Rule Mining on the given transaction data.
Association Rule Mining
Association Rule Mining is used when you want to find an association between different objects in a set, find frequent patterns in a transaction database, relational databases or any other information repository. The applications of Association Rule Mining are found in Marketing, Basket Data Analysis (or Market Basket Analysis) in retailing, clustering and classification. It can tell you what items do customers frequently buy together by generating a set of rules called Association Rules. In simple words, it gives you output as rules in form if this then that. Clients can use those rules for numerous marketing strategies:
· Changing the store layout according to trends
· Customer behavior analysis
· Catalogue design
· Cross marketing on online stores
· What are the trending items customers buy
· Customized emails with add-on sales
Given is a set of transaction data. You can see transactions numbered 1 to 5. Each transaction shows items bought in that transaction. You can see that Diaper is bought with Beer in three transactions. Similarly, Bread is bought with milk in three transactions making them both frequent item sets. Association rules are given in the form as below:
A=>B[Support,Confidence]A=>B[Support,Confidence]
The part before =>=> is referred to as if (Antecedent) and the part after =>=> is referred to as then (Consequent).
Where A and B are sets of items in the transaction data. A and B are disjoint sets.
Computer=>Anti−virusSoftware[Support=20%,confidence=60%]Computer=>Anti−virusSoftwar e[Support=20%,confidence=60%]
Above rule says:
1. 20% transaction show Anti-virus software is bought with purchase of a Computer
2. 60% of customers who purchase Anti-virus software is bought with purchase of a Computer
In the following section you will learn about the basic concepts of Association Rule Mining: Basic Concepts of Association Rule Mining
2. Support Count: Frequency of occurrence of an item-set
3. Support (s): Fraction of transactions that contain the item-set 'X' Support(X)=frequency(X)NSupport(X)=frequency(X)N
For a Rule A=>B, Support is given by:
Support(A=>B)=frequency(A,B)NSupport(A=>B)=frequency(A,B)N
Note: P(AUB) is the probability of A and B occurring together. P denotes probability.
Go ahead, try finding the support for Milk=>Diaper as an exercise.
1. Confidence (c): For a rule A=>B Confidence shows the percentage in which B is bought with A.
Confidence(A=>B)=P(A∩B)P(A)=frequency(A,B)frequency(A)Confidence(A=>B)=P(A∩B)P( A)=frequency(A,B)frequency(A)
The number of transactions with both A and B divided by the total number of transactions having A.
Confidence(Bread=>Milk)=34=0.75=75%Confidence(Bread=>Milk)=34=0.75=75%
Now find the confidence for Milk=>Diaper.
Note: Support and Confidence measure how interesting the rule is. It is set by the minimum support and minimum confidence thresholds. These thresholds set by client help to compare the rule strength according to your own or client's will. The closer to threshold the more the rule is of use to the client.
1. Frequent Itemsets: Item-sets whose support is greater or equal than minimum support threshold (min_sup). In above example min_sup=3. This is set on user choice.
2. Strong rules: If a rule A=>B[Support, Confidence] satisfies min_sup and min_confidence then it is a strong rule.
3. Lift: Lift gives the correlation between A and B in the rule A=>B. Correlation shows how one item-set A effects the item-set B.
Lift(A=>B)=SupportSupp(A)Supp(B)Lift(A=>B)=SupportSupp(A)Supp(B)
For example, the rule {Bread}=>{Milk}, lift is calculated as:
support(Bread)=45=0.8support(Bread)=45=0.8
support(Milk)=45=0.8support(Milk)=45=0.8
Lift(Bread=>Milk)=0.60.8∗0.8=0.9Lift(Bread=>Milk)=0.60.8∗0.8=0.9
· If the rule had a lift of 1,then A and B are independent and no rule can be derived from them.
· If the lift is > 1, then A and B are dependent on each other, and the degree of which is given by ift value.
· If the lift is < 1, then presence of A will have negative effect on B.
Goal of Association
Rule Mining
When you apply Association Rule Mining on a given set of transactions T your goal will be to find all rules with:
1. Support greater than or equal to min_support
2. Confidence greater than or equal to min_confidence
APRIORI Algorithm
In this part of the tutorial, you will learn about the algorithm that will be running behind R libraries for Market Basket Analysis. This will help you understand your clients more and perform analysis with more attention. If you already know about the APRIORI algorithm and how it works, you can get to the coding part.
Association Rule Mining is viewed as a two-step approach:
1. Frequent Itemset Generation: Find all frequent item-sets with support >= pre-determined min_support count
2. Rule Generation: List all Association Rules from frequent item-sets. Calculate Support and Confidence for all rules. Prune rules that fail min_support and min_confidence thresholds.
Frequent Itemset Generation is the most computationally expensive step because it requires a full database scan.
Above you have seen the example of only 5 transactions, but in real-world transaction data for retail can exceed up to GB s and TBs of data for which an optimized algorithm is needed to prune out Item-sets that will not help in later steps. For this APRIORI Algorithm is used.
Decision Trees
Decision Trees are a popular Data Mining technique that makes use of a tree-like structure to deliver consequences based on input decisions. One important property of decision trees is that it is used for both regression and classification. This type of classification method is capable of handling heterogeneous as well as missing data. Decision Trees are further capable of producing understandable rules. Furthermore, classifications can be performed without many computations. As mentioned above, both the classification and regression tasks can be performed with the help of Decision Trees. You can perform either classification or regression tasks here. Decision Trees can be visualised as follows:
To create a decision tree, you need to follow certain steps:
1. Choosing a
Variable
2. Assigning Data to
Nodes
3. Pruning the Tree
Common R Decision Trees Algorithms
There are three most common Decision Tree Algorithms:
· Classification and Regression Tree (CART) investigates all kinds of variables.
· Zero (developed by J.R. Quinlan) works by aiming to maximize information gain achieved by assigning each individual to a branch of the tree.
· Chi-Square Automation Interaction Detection (CHAID) – It is reserved for the investigation of discrete and qualitative independent and dependent variabl
Applications of Decision Trees
Decision Trees are used in the following areas of applications:
· Marketing and Sales – Decision Trees play an important role in a decision-oriented sector like marketing. In order to understand the consequences of marketing activities, organisations make use of Decision Trees to initiate careful measures. This helps in making efficient decisions that help the company to reap profits and minimize losses.
· Reducing Churn Rate – Banks make use of machine learning algorithms like Decision Trees to retain their customers. It is always cheaper to keep customers than to gain new ones. Banks are able to analyze which customers are more vulnerable to leaving their business. Based on the output, they are able to make decisions by providing better services, discounts as well as several other features. This ultimately helps them to reduce the churn rate.
· Anomaly & Fraud Detection – Industries like finance and banking suffer from various cases of fraud. In order to filter out anomalous or fraud loan applications, information and insurance fraud, these companies deploy decision trees to provide them with the necessary information to identify fraudulent customers.
· Medical Diagnosis – Classification trees identifies patients who are at risk of suffering from· serious diseases such as cancer and diabetes.
How to Create
Decision Trees in R
The Decision Tree techniques can detect criteria for the division of individual items of a group into predetermined classes that are denoted by n.
In the first step, the variable of the root node is taken. This variable should be selected based on its ability to separate the classes efficiently. This operation starts with the division of variable into the given classes. This results in the creation of subpopulations. This operation repeats until no separation can be obtained.
A tree exhibiting not more than two child nodes is a binary tree. The origin node is referred to as a node and the terminal nodes are the trees.
Defining Big Data
Big data is not a single technology but a combination of old and new technologies that helps companies gain actionable insight. Therefore, big data is the capability to manage a huge volume of disparate data, at the right speed, and within the right time frame to allow real-time analysis and reaction. As we note earlier in this chapter, big data is typically broken down by three characteristics: ✓ Volume: How much data ✓ Velocity: How fast that data is processed ✓ Variety: The various types of data Although it’s convenient to simplify big data into the three Vs, it can be misleading and overly simplistic. For example, you may be managing a relatively small amount of very disparate, complex data or you may be processing a huge volume of very simple data. That simple data may be all structured or all unstructured. Even more important is the fourth V: veracity. How accurate is that data in predicting business value? Do the results of a big data analysis actually make sense? It is critical that you don’t underestimate the task at hand. Data must be able to be verified based on both accuracy and context. An innovative business may want to be able to analyze massive amounts of data in real time to quickly assess the value of that customer and the potential to provide additional offers to that customer. It is necessary to identify the right amount and types of data that can be analyzed to impact business outcomes. Big data incorporates all data, including structured data and unstructured data from e-mail, social media, text streams, and more. This kind of data management requires that companies leverage both their structured and unstructured data.
Installing R packages
To use the data file with the format specified earlier, we don't need to install extra R packages. We just need to use the built-in functions available with R. Importing the data into R To perform analytics-related activities, we need to use the following functions to get the data into R: • CSV: read.csv() is intended for reading the comma separated value (CSV) files, where the decimal point is ",".
The retrieved data will be stored into one R object, which is considered as Dataframe. Dataframe <- read.csv("data.csv",sep=",") • TXT: To retrieve the tab separated values, the read.table() function will be used with some important parameters and the return type of this function will be Dataframe type. Dataframe <- read.table("data.csv", sep="\t") • .RDATA: Here, the .
RDATA format is used by R for storing the workspace data for a particular time period. It is considered as image file. This will store/retrieve all of the data available in the workspace. load("history.RDATA") • .rda: This is also R's native data format, which stores the specific data variable as per requirement. load("data_variables_a_and_b.rda")
Exporting the data from R To export the existing data object from R and to support data files as per requirements, we need to use the following functions: • CSV: Write the dataframe object into the csv data file via the following command: write.csv(mydata, "c:/mydata.csv", sep=",", row.names=FALSE) •
TXT: Write the data with the tab delimiters via the following command: write.table(mydata, "c:/mydata.txt", sep="\t") • .RDATA: To store the workspace data variables available to R session, use the following command: save.image() • .rda: This function is used to store specific data objects that can be reused later. Use the following code for saving them to the .rda files. # column vector a <- c(1,2,3) # column vector b <- c(2,4,6) # saving it to R (.rda) data format save(a, b, file=" data_variables_a_and_b.rda")
Importing the data into R We know how to check MySQL tables and their fields. After identification of useful data tables, we can import them in R using the following RMySQL command.
To retrieve the custom data from MySQL database as per the provided SQL query, we need to store it in an object: rs = dbSendQuery(mydb, "select * from sample_table") The available data[1]related information can be retrieved from MySQL to R via the fetch command as follows: dataset = fetch(rs, n=-1) Here, the specified parameter n = -1 is used for retrieving all pending records.
No comments:
Post a Comment
Tell your requirements and How this blog helped you.