Modified on
30 Dec 2022 07:06 pm
Skill-Lync
Machine learning algorithms learn from the data and help make a prediction or classification.
Depending upon whether the input data is labelled or not, we can classify the algorithm as supervised and unsupervised algorithms. Apart from this, we also have reinforced learning, where the algorithms get feedback based on their prediction. Based on this feedback, the machine learns. A typical example of supervised learning will be labelling all cat pictures as cat and dog pictures as dogs and asking the machine to learn. In contrast, for unsupervised learning, cat and dog pictures are given to the machine to learn by itself. Finally, in reinforcement learning, the machine gets positive feedback every time a cat is identified as a cat, and for wrong identification, it gets negative feedback. Based on this, the model learns and trains itself.
In the case of regression algorithms, we intend to find a mathematical relationship between the input and output. We try to find a function which takes the inputs as variables and tries to predict the output as a function of the input.
Some of the well-known regression models are
In the least square regression, we try to fit a best-fit line that passes through most of the points.
Here we fit a linear function or a polynomial function through the data points that explain the dataset well.
Here the output data is discrete, while the input could be continuous. In such a scenario, we use logistic regression. A classic example of this is the Breast cancer dataset.
Here the variables are added stepwise and then statistical tests are performed to check the significance. The variables that pass the statistical significance are further selected for modelling.
Multi-variate regression checks out the relationship between the dependent and independent variables. Essentially, this gives the behavior of the response variable based on the predictor variable.
Here the data is first plotted using a scatter plot and regions are identified, where different clusters are formed. Within the clusters, lines that fit the data are plotted.
Here the data is classified based on its neighbors. The value of k is decided by the user.
Here we have a two-layer artificial neural network that adjusts the weights as the learning happens. Here the learning is via competition rather than feedback corrections.
This is similar to LVQ, except that it is unsupervised.
Locally weighted learning is a group of functions that predicts a particular input based on the local model around it.
Here the data is segregated using hyperplanes. A new point is classified or regressed using the distance from the hyperplanes. Based on regularization, we have the following set of algorithms. Regularization is needed so we are not caught at high variance or high bias. Usually, the model is punished to avoid such scenarios.
In the cost function, we add a factor of lambda times L1 norm, which is nothing but the sum of absolute deviations.
In the cost function, we add a factor of lambda times L2 norm, which is nothing but the sum of the square of deviations.
Here the regularization is done by combining both L1 and L2 norm.
This is similar to stepwise regression. This plays a vital role when many attributes are to be considered.
Decision trees are constructed to have a node and branch-like structure. The trees grow branches as long as the data becomes pure. At this stage it is called the leaf. By purity, we denote that all the data at this region are similar. An example would be a class containing girls and boys. A node would be what is the gender? This will have two answers, boys and girls. Any data point (i.e., student) is classified into either of the two. When we check the boys node, all data points are boys, so in that way it is pure. Similarly, for the girls.
Here the classification is done based on the Gini impurity index. Here the classification is mostly binary.
Here the decision is made by using the entropy or the information gain.
Here the decision is made by using the entropy or the information gain.
Here the number of classes can be more than 2. This is more for a descriptive analysis.
The decision stump is a Decision tree model with just one decision-making node.
In this model regression can be performed and hence can be used for regression also. In the leaf nodes some functions take the input and predict the values.
There may scenarios where a proper division cannot be made and in those scenarios we use a continuous variable decision tree. This is also called as regression tree because decision at one place depends on decisions taken elsewhere.
In all the previous scenarios, a node was selected based on entropy or information gain. However, the node selection is made here by conducting a series of non-parametric tests.
The next type of algorithms which we study is the Bayesian algorithms. In Bayesian algorithm, there is an assumption that the Bayes theorem is valid. So this requires that the input variables are independent.
Some of the most popular Bayesian algorithms are listed below.
Mostly used for high-dimensional datasets, where there is an assumption that various features are independent. Here probabilities are calculated and accordingly, classification is made.
It is similar to the Naive Bayes, except that here we make an assumption that our input features follow a Gaussian distribution.
Here we are making frequency histograms based on whether the classification is binary or more than binary.
These algorithms classify a member based on the structures in the data. The data is organized into groups based on maximum commonality and similarity.
Some of the popular algorithms are
In the K means algorithm, the data is divided into K regions. Essentially, K central points are identified and new data is assigned to the group based on its proximity. The central points (also called centroids) are calculated by finding the means.
Here the centroids are calculated based on the medians.
In this there are two steps in the algorithms. The first is the estimation step and the second is the optimization step. In the first step, missing variables are estimated and in the second step the model's parameters are maximized.
In hierarchical clustering the data is clustered into various groups. Within the group the data points are similar. The difference between Hierarchical clustering and K means clustering is that in the latter the number of groups is already decided. There are two ways by which this algorithm works. One is agglomerative and the second is divisive. In agglomerative is bottom - up approach, while decisive is top-down approach. The agglomerative approach will make the entire data set into one cluster. First, a cluster is formed by taking nearby points. Then the cluster size increases by including the next points. This process repeats till all the points are brought into one big cluster.
In these algorithms, relationships between two variables are uncovered and rules that explain them are excavated from the data. These can be used for some sort of predictions.
In this algorithm, association rules are studied between members or transactions. For instance, in shopping, the shopkeepers are always interested to know if buyers who buy object A also buy object B. In case they buy, A is always kept near B. These kinds of association rules are mined from the dataset.
In this algorithm, association rules are mined between various transaction id sets. This is more efficient than the apriori algorithm.
These are inspired by the neuron structure in our brains. Neurons are interconnected to each other. While training the model, the weights of the interconnection are constantly adjusted.
It is a simple neuron with one node with binary output. The inputs can be many.
In multi-layer perceptrons, there are fully connected neural networks with 3 layers. If there is more than one hidden layer, it becomes a deep-learning Artificial neural network.
In this neural network, the errors in classification or prediction are backpropagated from output to input, and the weights are adjusted.
In the stochastic gradient descent, gradients are calculated for some part of the data, and for the next iteration, different points from the datasets are used.
Hopfield Network consists of a fully interconnected neural network. That is to say that all neurons are connected to all others. This network is used to learn associations.
This is a three-layer feed-forward neural network. The first layer is the input layer, the second is the hidden layer with an activation unit, and the last is the output layer. the activation unit mainly consists of Gaussian functions.
These are extensions of ANN. Some of the important algorithms in this category are
Convolutional neural networks are mainly used for classifying images. The images are stored as an array of pixels. These input arrays are multiplied by another array called the kernel or the filter. The size of the kernel need not be the same as the input. The features are extracted via this process of convolution.
In a recurrent neural network, the neurons send signal to each other in anyway. This is mainly suited to analyzing temporal data or sequential data.
LSTM is almost like a RNN, except that it can handle a lot of data. It consists of a cell, input gate, forget gate and an output gate. The three gates control the flow of information. The important parts of a messages are stored and used for further processing.
These are used to reduce the dimension of the data. A non-linear function describes the relationship between the input and the output. These will automatically capture features.
An auto-encoder contains three layers, The encoder, the decoder and the bottleneck. The encoder picks the most important features. The decoder tries to reconstruct the original information. Multiple auto encoders working together form a stacked autoencoder.
In a deep Boltzmann machine, all neurons are connected to each other; they are multi-directional. The connections grow exponentially.
Deep belief networks arise when we stack multiple deep Boltzmann machines.
Here larger data are reduced into smaller ones by using dimension reduction based on the inherent structure of the data. This technique aids in visualization or simplifying the data.
In principal component analysis, rotations are performed in higher dimensional space, to reduce the number of dimensions. This facilitates by reducing the complexity of the problem. A classic example would be the movement of chalk on a board can be tracked with a camera and we will get the position of the chalk in x,y and z direction. However in this case since chalk moves on the board, the motion is restricted into a plane and thus performing a PCA would reduce the dimension from 3 to 2.
PCR = PCA + LR (linear regression) works efficiently well on multivariate data
Here the algorithm tries to reduce the input as much as possible and still predict y. The difference between PCR and PLSR is that PCR concentrates on X alone, while PLSR considers Y also.
Mapping from a higher dimension to lower dimension using gradient descent methods.
Using the Kurtosis in the data, projection indexes are devised, which helps scale the data.
LDA finds a feature subspace and is mostly used in supervised learning. Here there is an inherent assumption that all classes come from a single Gaussian distribution.
Similar to LDA with relaxation on the assumption that all classes come from a single Gaussian distribution.
It is a general model which assumes that each class comes from a Gaussian distribution.
Here there is a mixture of linear regression models that is used for prediction purposes.
In an ensemble, methods are a combination of multiple models. These models work together to give better accuracy.
In this process, many weak models are combined to make a stronger model. By weak model, what we denote is that the model is just better than a random guess. While for a stronger model, the prediction is as accurate and almost close to the actual ones. Here some part of data is sampled and trained with models sequentially. Each model that succeeds tries to learn from the weakness of the previous model. The weak rules from all are combined to form a strong one. Boosting is used when there is low variance and high bias. AdaBoost and XGBoost are two very popular techniques.
In Bagging, the models run in parallel. Bagging are used when there is high variance and low bias.
Feature selection algorithm
This manipulates the input to reduce the noise and get more relevant information to make a prediction.
algorithm accuracy evaluation
For classification, we can use accuracy, precision, recall, F-1 score, ROC, AUC
For regression, we can use MSE, MAE
Ranking metrics would involve finding MRR, DCG and NDCG
Correlation is one of the statistical metrics
PSNR, SSIM and IOU are used for computer vision
Perplexity, BLEU scores are used for NLP
Inception score, Frecher Inception distance for deep learning
performance measures
Author
Navin Baskar
Author
Skill-Lync
Subscribe to Our Free Newsletter
Continue Reading
Related Blogs
When analysing SQL data, Microsoft Excel can come into play as a very effective tool. Excel is instrumental in establishing a connection to a specific database that has been filtered to meet your needs. Through this process, you can now manipulate and report your SQL data, attach a table of data to Excel or build pivot tables.
08 Aug 2022
Microsoft introduced and distributes the SQL Server, a relational database management system (RDBMS). SQL Server is based on SQL, a common programming language for communicating with relational databases, like other RDBMS applications.
23 Aug 2022
Machine Learning is a process by which we train a device to learn some knowledge and use the awareness of that acquired information to make decisions. For instance, let us consider an application of machine learning in sales.
01 Jul 2022
Companies seek candidates who can differentiate themselves from the colossal pool of engineers. You could have a near-perfect CGPA and be a bookie, but the value you can provide to a company determines your worth.
04 Jul 2022
Often while working with datasets, we encounter scenarios where the data present might be very scarce. Due to this scarcity, dividing the data into tests and training leads to a loss of information.
27 Dec 2022
Author
Skill-Lync
Subscribe to Our Free Newsletter
Continue Reading
Related Blogs
When analysing SQL data, Microsoft Excel can come into play as a very effective tool. Excel is instrumental in establishing a connection to a specific database that has been filtered to meet your needs. Through this process, you can now manipulate and report your SQL data, attach a table of data to Excel or build pivot tables.
08 Aug 2022
Microsoft introduced and distributes the SQL Server, a relational database management system (RDBMS). SQL Server is based on SQL, a common programming language for communicating with relational databases, like other RDBMS applications.
23 Aug 2022
Machine Learning is a process by which we train a device to learn some knowledge and use the awareness of that acquired information to make decisions. For instance, let us consider an application of machine learning in sales.
01 Jul 2022
Companies seek candidates who can differentiate themselves from the colossal pool of engineers. You could have a near-perfect CGPA and be a bookie, but the value you can provide to a company determines your worth.
04 Jul 2022
Often while working with datasets, we encounter scenarios where the data present might be very scarce. Due to this scarcity, dividing the data into tests and training leads to a loss of information.
27 Dec 2022
Related Courses