Helpful to have already answered:variance tradeoffcross valida|Cross validation]

Visually Explained | StatQuest | Intuitive ML


What is a binary classification task?

Read here.



What is a Maximal Margin Classifier?

Maximal Margin Classifier

The idea is to simply draw a line through the data which separates it into two classes. A maximal margin classifier simply takes the halfway point between class_a and class_b, whereas a Support Vector Classifier uses a more robust approach, by allowing for some misclassifications.

A few more definitions: ☞ Also known as a hard margin classifier (does not allow any misclassifications in training) ☞ The maximal margin classifier is a crude, hypothetical model which motivates the need for the Support Vector Classifier. ☞ The result of using a threshold which maximises the margin. ☞ The maximal margin classifier is the hyperplane with the maximum margin, subject to



What is a margin?

Margin

The margin is the shortest distance from the threshold to any observation


What is a hard margin?

Hard margin

A hard margin is a hyperplane that can perfectly separate the data into two classes without any misclassification. Hard margins can only exist under two conditions:

  1. The data is linearly separable
  2. There are no outliers

Therefore, if either of these are not satisfied, the hard margin cannot exist.



What is a hyperplane?

Hyperplane

  • A hyperplane is a linear decision surface that splits the space into two parts
  • A hyperplane in 2-dimensions is a line, otherwise it is a plane.


  • A hyperplane in this space is described by the equation where is a vector of weights , is a vector of features , and is the bias term.
  • A hyperplane in is an dimensional subspace. A hyperplane is clearly a binary classifier.

Flat Affine Subspaces

(From mathematics)

  • Instead of saying a point, A flat, affine, 0-dimensional subspace
  • Instead of saying a hyperplane, A flat affine subspace


What is a Support Vector?

Support Vector

  • The supporting vectors are the data points that determine the position and orientation of the decision boundary (hyperplane) between different classes.
  • Note that it is not a line, but a specific data point.



What is the weakness of Maximum Margin Classifiers?
  • They are extremely sensitive to outliers in the training data.
  • Sometimes, the data is not linearly separable due to noise and outliers.

↳ So, do Maximal Margin Classifier’s have high bias or high variance?

Since MMC’s are very sensitive to the training data, they have low bias and high variance. ☞ They partition the training data perfectly (low bias) ☞ As a result this makes them susceptible to misclassifying real data (high variance).

Deciding whether or not to allow misclassifications by loosening the margin into a soft margin is an example of the bias vs. variance tradeoff that plagues all of machine learning.



If you allow misclassifications, is it still a Maximum Margin Classifier?

No. Picking a threshold which allows misclassifications instead of just maximising the margin means that our maximum margin classifier evolves into a soft margin classifier.

⁂ A soft margin classifier is another name for a support vector classifier



What is the idea behind Support Vector Classifiers?
  • In practice, real data is messy and cannot be separated cleanly with a hyperplane - there may be overlap.
  • Maximal Margin Classifiers are extremely sensitive to outliers, for example:

(Since the MMC simply sets the threshold value as the halfway point between class_a and class_b)

Introducing the Support Vector Classifier

  • The idea is to use a soft margin instead of a hard margin to separate the data.
  • This divider is called the ‘threshold’, and observations are classified based on which side of the margin they fall.
  • Perhaps the most simple idea to segregate data, and yet some surprisingly deep math behind it.
  • This is a prerequisite for Support Vector Machines

↳ What is a soft margin?

Soft margin

A soft margin is a hyperplane that allows some misclassification by introducing slack variables that allow some data points to be on the wrong side of the margin.



Based on the following dummy data, explain which out of MMC, SVC and SVM would be the best model to use.


a) A Maximal Margin Classifier

☞ Because the data is clean and linearly separable.

b) A Support Vector Classifier / Soft Margin

☞ Because the data contains outliers which would throw off a MMC.

c) A Support Vector Machine

☞ Because the data cannot be linearly separated. Read on to the next section, Support Vector Machines to learn how…



How do you find the maximum margin mathematically?

Convex Optimization

In the background, SVC’s solve a convex optimization problem.

To answer this we first need to introduce some notation. This type of notation will come in handy later in deep learning.

  • First, let’s represent a single observation as a vector known as a feature vector.
    • If your data had two features - weight and height - then your vector would contain all the features associated with one observation. E.g
    • Therefore when we say , we mean “for all datapoints in class 1”.
  • represents the weight vector, a vector which is perpendicular to the decision boundary (hyperplane).
    • The direction of tells us the predicted class
    • The magnitude of helps determine the distance from the decision boundary to the closest data points (the support vectors).
  • : This dot product (where represents our input data point) is a way to measure how much the data point xx is in the direction of the vector ww. In SVM, it helps in determining which side of the decision boundary the data point lies.
    • simply transposes the weights vector so that it can be used in a dot product, or just .
  • : This is the bias term. It allows the decision boundary to be offset from the origin. If you think of the decision boundary as a seesaw, the bias allows you to move the fulcrum left or right to find the optimal balance point.

Linear combination: Now, at the core of maximal margin classifiers (and SVM’s) is the equation:

  1. is the equation of the hyperplane.
    1. You should recognize that is the equation of a hyperplane in an n-dimensional space.
    2. For example, specifies a line in two dimensions.
  2. The support vectors and are such that and
  3. for in , and for in

These constraints ensure that the data points are not only classified correctly but are also as far away from the decision boundary as possible, enhancing the margin.

(diagram)


Now how do we find the hyperplane? This is the objective function or optimization problem:

The SVM aims to find the optimal and that maximize the margin between two classes. The margin is inversely proportional to the norm of , so maximizing is equivalent to minimizing , which helps in maximizing the margin.

If you maximize this function, subject to the constraints:

  • for in
  • for in

Then you get values of and bias which combine to define a decision boundary that separates the two classes with the largest possible margin.


↳ Using the equation, how can you classify new observations?

Simply substitute your new observation into:

  • , the new observations belongs to class 1.
  • , the new observations belongs to class 2.


How do you mathematically find the soft margin?

The goal is to find the best separating hyperplane that maximizes the margin between two classes while keeping the classification error as low as possible.

You should use cross validation.