Perceptron algorithm is one of the ground breaking improvements in Machine Learning, especially in Supervised Learning.

It was invented in 1958 by Rosenblatt and still used as the basis of very powerful algorithms like Neural Networks.

The main idea behind it, is a simple Optimization Problem. The following description belongs to Perceptrons, but it tells a lot about most other machine learning algorithms as well.

- Define your error function.
- Minimize your error function.

### Error Function

To break it down, let’s first take a look at the inside of the parenthesis. The first thing we should understand is the variable **w**

### Weight Vector

Above mentioned weight vector w – or scalar, depending on the dimensionality of our dataset – is everything.

**A weight vector decides how important a single data feature is. **

Imagine a data set of 10 000 runners, and we want to predict whether he/she is an amateur or a professional runner.

Name | Leg Length (cm) | Lung Volume (cm^3) | Nationality | Hair Length | Shoe Size | Mobile Carrier | Amateur |

Joe | 104 | 32 | US | Medium | 10 | AT&T | No |

… | |||||||

Mike | 100 | 27 | UK | Long | 11 | Verizon | Yes |

Going back to perceptron and its math, first 7 columns represents a feature in our dataset, and the last column is the corresponding class to that item

Here, x1, x2 … x7 corresponds to a a feature in our data set. If we take our first item in data set,

- x1 will be Joe
- x2 will be 104 , etc.

To give a more intuitive approach to weights and perceptron,

I suppose some of you have already figured out, that the weight value w2 ( leg length) would have a bigger value than its mobile carrier because it obviously has more effect on the national degree vis a vis athlete’s performance

### Prediction

Suppose we now have a weight vector, not necessarily the most optimal one, and we are inserting the data for test object “Joe” in our Perceptron.

This is a very straightforward process, as one can imagine.

- Multiply each w1 and x1 value
- Sum up the separate weighted feature values
- Use that value as input to a non-linear function f(.)

But what is the non-linear function f we should use with perceptron.?

It means, that our final function -or predictor, activation function etc. – , must output +1, if the test object is a professional and -1 if an amateur.

But, given our random weight vector, f(x) does not always result to the correct classification

### Going back to Error Function

We need to find a way to find an f(x) so perfect, that each test object is classified correctly.

- Ym is the expected class (+1 or -1)
- WtXm is our prediction, where W is the vector containing w1,w2..

Since M is the set of misclassified points, wtxmym can never be a positive value, because if ym is +1, wtxm is -1, and reverse. For the sake of simplicity, We are negating the whole term, and get ourselves a nice positive value to minimize.

Now that we have a function of W, that we can minimize, we are almost done

Let’s proceed to training our model…

#### Gradient Descent

- Initialize wold – could be a random value or 1/n
- While there is a misclassified data point,

Pick a single “random” point Xm

Descent in the direction of the gradient at that single point

### Learning Rate

mostly effects the convergence speed

- But, with variable sized learning rate, convergence is proved for non-linearly separable sets as well.
- Good convergence speed can be achieved by n(t) = 1/t

## Problems of Perceptron

- Each update might lead to new misclassifications.
- Many solutions might exist
- Convergence might be slow
- Only solves linearly separable problems
- Does not give the same result every time, because xm is chosen randomly

## Tips and Notes

- After a single update error of that data point is always reduced
- If there is a solution – if the set is linearly separable- perceptron always finds the solution
- Perceptron solves Binary Separation Problems

### Full Batch GD vs Stochastik GD

Weight Update is calculated with all training data | Weight update is the gradient of a single point |

More stable | Faster |

State of Art |

## Bonus – Perceptron Code

import scipy as sp import numpy as np import scipy.io as io import pylab as pl def train_perceptron(X, Y, iterations=200, eta=.1, option=0): ''' Trains a linear perceptron Definition: w, b, acc = train_perceptron(X,Y,iterations=200,eta=.1) Input: X - DxN array of N data points with D features Y - 1D array of length N of class labels {-1, 1} iter - optional, number of iterations, default 200 eta - optional, learning rate, default 0.1 option - optional, defines how eta is updated in each iteration Output: w - 1D array of length D, weight vector b - bias term for linear classification acc - 1D array of length iter, contains classification accuracies after each iteration Accuracy = #correctly classified points / N ''' assert option == 0 or option == 1 or option == 2 np.random.seed(1) # do not change acc = np.zeros(iterations) # include the bias term by adding a row of ones to X X = np.concatenate((np.ones((1, X.shape[1])), X)) # initialize weight vector weights = np.ones((X.shape[0])) / X.shape[0] for it in np.arange(iterations): # indices of misclassified data wrong = (np.sign(weights.dot(X)) != Y).nonzero()[0] # compute accuracy acc[it] (1 point) # ... your code here acc[it] = (X.shape[1] - wrong.shape[0]) / X.shape[1] if wrong.shape[0] > 0: # pick a random misclassified data point (2 points) point = wrong[np.random.randint(0, wrong.shape[0], 1)] xm = X.T[point][0] ym = Y[point] # update weight vector (using different learning rates ) if option == 0: # ... your code here new_eta = eta / (1 + it) change = (xm * ym) weights += new_eta * change elif option == 1: new_eta = eta change = (xm * ym) weights += new_eta * change elif option == 2: new_eta = eta * (1 + it) change = (xm * ym) weights += new_eta * change b = -weights[0] w = weights[1:] # return weight vector, bias and accuracies return w, b, acc

## 1 thought on “Introduction to Machine Learning – Perceptron Algorithm”