A few useful things to know about machine learning
A white paper by Pedro Domingos
This paper by Pedro Domingos briefly talks about the important things to be known about machine learning, the models of machine learning.
This paper is mainly focus on Classification, Overfitting, Curse of Dimensionality, .
A classifier is a system that inputs (typically) a vector of discrete and/or continuous feature values and outputs a single discrete value, the class. The test of the learner is whether this classifier produces the correct output y for future examples x (for example, whether the spam filter correctly classifies previously unseen email messages as spam or not spam).
Learning = Representation + Evaluation + Optimization
Representation:
- A representation for a learner is tantamount to choosing the set of classifiers that it can possibly learn. This set is called the hypothesis space of the learner.
- Hypothesis space is the set of all the possible legal hypothesis. This is the set from which the machine learning algorithm would determine the best possible (only one) which would best describe the target function or the outputs.
- A hypothesis is a function that best describes the target.
- Best Solution = Hypothesis
Evaluation:
- Evaluation function is needed to distinguish between good and bad classifiers.
Optimization:
- A method to search among the classifiers in the language for the highest-scoring one.
In this paper the author has clearly explained Overfitting, Underfitting and main cause of the situation and how to tackle it.
Overfitting: When the model performs better with taining data but goes down on test data. This is due to bias and variance.
Bias: Error in training data
variance: Error in test data
How to avoid overfitting:
- Cross-validation
- Regularization
- chi-square
- Increase the data
Curse of dimensionality
- Many algorithms that work fine in low dimensions become intractable when the input is high dimensional.
- The similarity based reasoning that machine learning algorithms depend on (explicitly or implicitly) breaks down in high dimensions.
- It is difficult to visualize the high dimensional data.
- In high dimensions it is difficult to understand what is happening. This in turn makes it difficult to design a good classifier.
Dimensionality reduction techniques:
- Feature selection.
- Feature extraction.
- Principal Component Analysis (PCA) etc.,
All these topics are explained in detailed in the paper, the link is in the reference section.
Reference:
- https://dl.acm.org/doi/pdf/10.1145/2347736.2347755?casa_token=fow_STwKc9EAAAAA:GgrxKnabhEVH-Ae6v0GSnxn0zVA1LC1SmrPMo2a8_MK1pRqLu1p7IfYhfsPdfidSqiWtT21aKB8zNg
- Best explanation of hypothesis, hypothesis space, bias, variance: https://www.gla.ac.in/pdf/inductive_bias_hypothesis_hypothesis_space_variance.pdf