## Supervised Learning Algorithms

- Nearest Neighbor Classification
- Naive Bayes Classification
- Decision Trees Classification
- Linear Regression Numeric prediction
- Regression Trees Numeric prediction
- Neural Networks
- Support Vector Machines

## Unsupervised Learning Algorithms

## Meta-Learning Algorithms

## Supervised Learning Algorithms

### Nearest Neighbor Classification

#### Nearest neighbor classifiers are defined by their characteristic of classifying unlabeled examples by assigning them the class of similar labeled examples. Despite the simplicity of this idea, nearest neighbor methods are extremely powerful.

Strengths | Weaknesses |
---|---|

Simple and effective | Does not produce a model, limiting the ability to understand how the features are related to the class |

Makes no assumptions about the underlying data distribution | Requires selection of an appropriate k |

Fast training phase | Slow classification phase |

Nominal features and missing data require additional processing |

##### Diagnosing Breast Cancer

wbcd_test_pred | |||
---|---|---|---|

wbcd_test_labels | Benign | Malignant | Row Total |

Benign | 77 | 0 | 77 |

1.000 | 0.000 | 0.770 | |

0.975 | 0.000 | ||

0.770 | 0.000 | ||

Malignant | 2 | 21 | 23 |

0.087 | 0.913 | 0.230 | |

0.025 | 1.000 | ||

0.020 | 0.210 | ||

Column Total | 79 | 21 | 100 |

0.790 | 0.210 | 0.210 |

### Naive Bayes Classification

#### The technique descended from the work of the 18th century mathematician Thomas Bayes, who developed foundational principles to describe the probability of events, and how probabilities should be revised in the light of additional information. These principles formed the foundation for what are now known as Bayesian methods.

Strengths | Weaknesses |
---|---|

Simple, fast, and very effective | Relies on an often-faulty assumption of equally important and independent features |

Does well with noisy and missing data | Not ideal for datasets with many numeric features |

Requires relatively few examples for training, but also works well with very large numbers of examples | Estimated probabilities are less reliable than the predicted classes |

Easy to obtain the estimated probability for a prediction |

##### Filtering Mobile Phone Spam

### Decision Trees Classification

#### Decision tree learners are powerful classifiers, which utilize a tree structure to model the relationships among the features and the potential outcomes.

### C5.0 Algorithm

Strengths | Weaknesses |
---|---|

An all-purpose classifier that does well on most problems | Decision tree models are often biased toward splits on features having a large number of levels |

Highly automatic learning process, which can handle numeric or nominal features, as well as missing data | It is easy to overfit or underfit the model |

Excludes unimportant features | Can have trouble modeling some relationships due to reliance on axis-parallel splits |

Can be used on both small and large datasets | Small changes in the training data can result in large changes to decision logic |

Results in a model that can be interpreted without a mathematical background (for relatively small trees) | Large trees can be difficult to interpret and the decisions they make may seem counterintuitive |

More efficient than other complex models |

##### Identifying Risky Bank Loans

###### Trail 10

predicted default | |||
---|---|---|---|

actual default | 1 | 2 | Row Total |

1 | 57 | 10 | 67 |

0.57 | 0.10 | ||

2 | 17 | 16 | 33 |

0.17 | 0.16 | ||

Column Total | 74 | 26 | 100 |

###### Trial 9

predicted default | |||
---|---|---|---|

actual default | 1 | 2 | Row Total |

1 | 60 | 7 | 67 |

0.60 | 0.07 | ||

2 | 19 | 14 | 33 |

0.19 | 0.14 | ||

Column Total | 79 | 21 | 100 |

### Ripper Algorithm

Strengths | Weaknesses |
---|---|

Generates easy-to-understand, human-readable rules | May result in rules that seem to defy common sense or expert knowledge |

Efficient on large and noisy datasets | Not ideal for working with numeric data |

Generally produces a simpler model than a comparable decision tree | Might not perform as well as more complex models |

##### Identifying Poisonous Mushrooms

###### === Summary ===

###### Correctly Classified Instances 8004 98.5229 %

###### Incorrectly Classified Instances 120 1.4771 %

###### Kappa statistic 0.9704

###### Mean absolute error 0.0148

###### Root mean squared error 0.1215

###### Relative absolute error 2.958 %

###### Root relative squared error 24.323 %

###### Total Number of Instances 8124

###### Number of Rules : 9

### Linear Regression Numeric prediction

#### Regression is concerned with specifying the relationship between a single numeric dependent variable (the value to be predicted) and one or more numeric independent variables (the predictors).

### Multi Linear Regression

Strengths | Weaknesses |
---|---|

By far the most common approach for modeling numeric data | Makes strong assumptions about the data |

Can be adapted to model almost any modeling task | The model's form must be specified by the user in advance |

Provides estimates of both the strength and size of the relationships among features and the outcome | Only works with numeric features, so categorical data requires extra processing |

##### Predicting Medical Expenses

### Regression Trees Numeric prediction

Strengths | Weaknesses |
---|---|

Combines the strengths of decision trees with the ability to model numeric data | Not as well-known as linear regression |

Does not require the user to specify the model in advance | Requires a large amount of training data |

Uses automatic feature selection, which allows the approach to be used with a very large number of features | Difficult to determine the overall net effect of individual features on the outcome |

May fit some types of data much better than linear regression | Large trees can become more difficult to interpret than a regression model |

##### Estimating the Quality of Wines

### Neural Networks

#### An Artificial Neural Network (ANN) models the relationship between a set of input signals and an output signal using a model derived from our understanding of how a biological brain responds to stimuli from sensory inputs.

Strengths | Weaknesses |
---|---|

Can be adapted to classification or numeric prediction problems | Extremely computationally intensive and slow to train, particularly if the network topology is complex |

Capable of modeling more complex patterns than nearly any algorithm | Very prone to overfitting training data |

Makes few assumptions about the data's underlying relationships | Results in a complex black box model that is difficult, if not impossible, to interpret |

##### Modeling the strength of concrete

##### Train model

##### Improved model

### Support Vector Machines

#### A Support Vector Machine (SVM) can be imagined as a surface that creates a boundary between points of data plotted in multidimensional that represent examples and their feature values.

Strengths | Weaknesses |
---|---|

Can be used for classification or numeric prediction problems | Finding the best model requires testing of various combinations of kernels and model parameters |

Not overly influenced by noisy data and not very prone to overfitting | Can be slow to train, particularly if the input dataset has a large number of features or examples |

May be easier to use than neural networks, particularly due to the existence of several well-supported SVM algorithms | Results in a complex black box model that is difficult, if not impossible, to interpret |

##### Performing OCR

Train data result | |
---|---|

False | True |

0.16075 | 0.83925 |

Improved Data | |

0.0695 | 0.9305 |

## Unsupervised Learning Algorithms

### Association Rules Pattern detection

#### Much work has been done to identify heuristic algorithms for reducing the number of itemsets to search. Perhaps the most-widely used approach for efficiently searching large databases for rules is known as Apriori. Introduced in 1994 by Rakesh Agrawal and Ramakrishnan Srikant, the Apriori algorithm has since become somewhat synonymous with association rule learning.

Strengths | Weaknesses |
---|---|

Is capable of working with large amounts of transactional data | Not very helpful for small datasets |

Results in rules that are easy to understand | Requires effort to separate the true insight from common sense |

Useful for "data mining" and discovering unexpected knowledge in databases | Easy to draw spurious conclusions from random patterns |

##### Identifying Frequently Purchased Groceries

LHS | RHS | Support | Confidence | Lift | Count |
---|---|---|---|---|---|

{herbs} | => {root vegetables} | 0.007015760 | 0.4312500 | 3.956477 | 69 |

{berries} | => {whipped/sour cream} | 0.009049314 | 0.2721713 | 3.796886 | 89 |

{other vegetables, tropical fruit, whole milk} | => {root vegetables} | 0.007015760 | 0.4107143 | 3.768074 | 69 |

### k-means clustering

#### Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groups of similar items. The k-means algorithm is perhaps the most commonly used clustering method.

Strengths | Weaknesses |
---|---|

Uses simple principles that can be explained in non-statistical terms | Not as sophisticated as more modern clustering algorithms |

Highly flexible, and can be adapted with simple adjustments to address nearly all of its shortcomings | Because it uses an element of random chance, it is not guaranteed to find the optimal set of clusters |

Performs well enough under many real-world use cases | Requires a reasonable guess as to how many clusters naturally exist in the data |

Not ideal for non-spherical clusters or clusters of widely varying density |

##### Finding Teen Market Segments

## Meta-Learning Algorithms

### Bagging Dual use

#### As described by Leo Breiman in 1994, bagging generates a number of training datasets by bootstrap sampling the original training data. These datasets are then used to generate a set of models using a single learning algorithm.

### Boosting Dual use

#### A method that boosts the performance of weak learners to attain the performance of stronger learners is called booting.

### Random Forests Dual use

#### This method is combination of the base principles of bagging with random feature selection to add additional diversity to the decision tree models.

Strengths | Weaknesses |
---|---|

An all-purpose model that performs well on most problems | Unlike a decision tree, the model is not easily interpretable |

Can handle noisy or missing data as well as categorical or continuous features | May require some work to tune the model to the data |

Selects only the most important features | |

Can be used on data with an extremely large number of features or examples |