View on GitHub

Machine Learning with R

Supervised Learning Algorithms

Unsupervised Learning Algorithms

Meta-Learning Algorithms

Supervised Learning Algorithms

Nearest Neighbor Classification

Nearest neighbor classifiers are defined by their characteristic of classifying unlabeled examples by assigning them the class of similar labeled examples. Despite the simplicity of this idea, nearest neighbor methods are extremely powerful.

Strengths Weaknesses
Simple and effective Does not produce a model, limiting the ability to understand how the features are related to the class
Makes no assumptions about the underlying data distribution Requires selection of an appropriate k
Fast training phase Slow classification phase
Nominal features and missing data require additional processing
Diagnosing Breast Cancer
wbcd_test_pred
wbcd_test_labels Benign Malignant Row Total
Benign 77 0 77
1.000 0.000 0.770
0.975 0.000
0.770 0.000
Malignant 2 21 23
0.087 0.913 0.230
0.025 1.000
0.020 0.210
Column Total 79 21 100
0.790 0.210 0.210

Naive Bayes Classification

The technique descended from the work of the 18th century mathematician Thomas Bayes, who developed foundational principles to describe the probability of events, and how probabilities should be revised in the light of additional information. These principles formed the foundation for what are now known as Bayesian methods.

Strengths Weaknesses
Simple, fast, and very effective Relies on an often-faulty assumption of equally important and independent features
Does well with noisy and missing data Not ideal for datasets with many numeric features
Requires relatively few examples for training, but also works well with very large numbers of examples Estimated probabilities are less reliable than the predicted classes
Easy to obtain the estimated probability for a prediction
Filtering Mobile Phone Spam

All Words

Spam Not spam

Decision Trees Classification

Decision tree learners are powerful classifiers, which utilize a tree structure to model the relationships among the features and the potential outcomes.

C5.0 Algorithm

Strengths Weaknesses
An all-purpose classifier that does well on most problems Decision tree models are often biased toward splits on features having a large number of levels
Highly automatic learning process, which can handle numeric or nominal features, as well as missing data It is easy to overfit or underfit the model
Excludes unimportant features Can have trouble modeling some relationships due to reliance on axis-parallel splits
Can be used on both small and large datasets Small changes in the training data can result in large changes to decision logic
Results in a model that can be interpreted without a mathematical background (for relatively small trees) Large trees can be difficult to interpret and the decisions they make may seem counterintuitive
More efficient than other complex models
Identifying Risky Bank Loans
Trail 10
predicted default
actual default 1 2 Row Total
1 57 10 67
0.57 0.10
2 17 16 33
0.17 0.16
Column Total 74 26 100
Trial 9
predicted default
actual default 1 2 Row Total
1 60 7 67
0.60 0.07
2 19 14 33
0.19 0.14
Column Total 79 21 100

Ripper Algorithm

Strengths Weaknesses
Generates easy-to-understand, human-readable rules May result in rules that seem to defy common sense or expert knowledge
Efficient on large and noisy datasets Not ideal for working with numeric data
Generally produces a simpler model than a comparable decision tree Might not perform as well as more complex models
Identifying Poisonous Mushrooms
=== Summary ===
Correctly Classified Instances 8004 98.5229 %
Incorrectly Classified Instances 120 1.4771 %
Kappa statistic 0.9704
Mean absolute error 0.0148
Root mean squared error 0.1215
Relative absolute error 2.958 %
Root relative squared error 24.323 %
Total Number of Instances 8124
Number of Rules : 9

Linear Regression Numeric prediction

Regression is concerned with specifying the relationship between a single numeric dependent variable (the value to be predicted) and one or more numeric independent variables (the predictors).

Multi Linear Regression

Strengths Weaknesses
By far the most common approach for modeling numeric data Makes strong assumptions about the data
Can be adapted to model almost any modeling task The model's form must be specified by the user in advance
Provides estimates of both the strength and size of the relationships among features and the outcome Only works with numeric features, so categorical data requires extra processing
Predicting Medical Expenses

Histogram

plot

poltwe

Regression Trees Numeric prediction

Strengths Weaknesses
Combines the strengths of decision trees with the ability to model numeric data Not as well-known as linear regression
Does not require the user to specify the model in advance Requires a large amount of training data
Uses automatic feature selection, which allows the approach to be used with a very large number of features Difficult to determine the overall net effect of individual features on the outcome
May fit some types of data much better than linear regression Large trees can become more difficult to interpret than a regression model
Estimating the Quality of Wines

poltwee poltweee poltweeee

Neural Networks

An Artificial Neural Network (ANN) models the relationship between a set of input signals and an output signal using a model derived from our understanding of how a biological brain responds to stimuli from sensory inputs.

Strengths Weaknesses
Can be adapted to classification or numeric prediction problems Extremely computationally intensive and slow to train, particularly if the network topology is complex
Capable of modeling more complex patterns than nearly any algorithm Very prone to overfitting training data
Makes few assumptions about the data's underlying relationships Results in a complex black box model that is difficult, if not impossible, to interpret
Modeling the strength of concrete
Train model

Neural Network

Improved model

Neural Network

Support Vector Machines

A Support Vector Machine (SVM) can be imagined as a surface that creates a boundary between points of data plotted in multidimensional that represent examples and their feature values.

Strengths Weaknesses
Can be used for classification or numeric prediction problems Finding the best model requires testing of various combinations of kernels and model parameters
Not overly influenced by noisy data and not very prone to overfitting Can be slow to train, particularly if the input dataset has a large number of features or examples
May be easier to use than neural networks, particularly due to the existence of several well-supported SVM algorithms Results in a complex black box model that is difficult, if not impossible, to interpret
Performing OCR

OCR

Train data result
False True
0.16075 0.83925
Improved Data
0.0695 0.9305

Unsupervised Learning Algorithms

Association Rules Pattern detection

Much work has been done to identify heuristic algorithms for reducing the number of itemsets to search. Perhaps the most-widely used approach for efficiently searching large databases for rules is known as Apriori. Introduced in 1994 by Rakesh Agrawal and Ramakrishnan Srikant, the Apriori algorithm has since become somewhat synonymous with association rule learning.

Strengths Weaknesses
Is capable of working with large amounts of transactional data Not very helpful for small datasets
Results in rules that are easy to understand Requires effort to separate the true insight from common sense
Useful for "data mining" and discovering unexpected knowledge in databases Easy to draw spurious conclusions from random patterns
Identifying Frequently Purchased Groceries

Neural Network Neural Network Neural Network

LHS RHS Support Confidence Lift Count
{herbs} => {root vegetables} 0.007015760 0.4312500 3.956477 69
{berries} => {whipped/sour cream} 0.009049314 0.2721713 3.796886 89
{other vegetables, tropical fruit, whole milk} => {root vegetables} 0.007015760 0.4107143 3.768074 69

k-means clustering

Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groups of similar items. The k-means algorithm is perhaps the most commonly used clustering method.

Strengths Weaknesses
Uses simple principles that can be explained in non-statistical terms Not as sophisticated as more modern clustering algorithms
Highly flexible, and can be adapted with simple adjustments to address nearly all of its shortcomings Because it uses an element of random chance, it is not guaranteed to find the optimal set of clusters
Performs well enough under many real-world use cases Requires a reasonable guess as to how many clusters naturally exist in the data
Not ideal for non-spherical clusters or clusters of widely varying density
Finding Teen Market Segments

k-mean kcluster

Meta-Learning Algorithms

Bagging Dual use

As described by Leo Breiman in 1994, bagging generates a number of training datasets by bootstrap sampling the original training data. These datasets are then used to generate a set of models using a single learning algorithm.

Boosting Dual use

A method that boosts the performance of weak learners to attain the performance of stronger learners is called booting.

Random Forests Dual use

This method is combination of the base principles of bagging with random feature selection to add additional diversity to the decision tree models.

Strengths Weaknesses
An all-purpose model that performs well on most problems Unlike a decision tree, the model is not easily interpretable
Can handle noisy or missing data as well as categorical or continuous features May require some work to tune the model to the data
Selects only the most important features
Can be used on data with an extremely large number of features or examples