# Introduction to Machine Learning – Perceptron Algorithm

Perceptron algorithm is one of the ground breaking improvements in Machine Learning, especially in Supervised Learning.

It was invented in 1958 by Rosenblatt and still used as the basis of very powerful algorithms like Neural Networks.

The main idea behind it, is a simple Optimization Problem. The following description belongs to Perceptrons, but it tells a lot about most other machine learning algorithms as well.

### Error Function

To break it down, let’s first take a look at the inside of the parenthesis. The first thing we should understand is the variable w

### Weight Vector

Above mentioned weight vector w – or scalar, depending on the dimensionality of our dataset – is everything.

A weight vector decides how important a single data feature is.

Imagine a data set of 10 000 runners, and we want to predict whether he/she is an amateur or a professional runner.

Going back to perceptron and its math, first 7 columns represents a feature in our dataset, and the last column is the corresponding class to that item

Here, x1, x2 … x7 corresponds to a a feature in our data set. If we take our first item in data set,

• x1 will be Joe
• x2 will be 104 , etc.

To give a more intuitive approach to weights and perceptron,

I suppose some of you have already figured out, that the weight value w2 ( leg length) would have a bigger value than its mobile carrier because it obviously has more effect on the national degree vis a vis athlete’s performance

### Prediction

Suppose we now have a weight vector, not necessarily the most optimal one, and we are inserting the data for test object “Joe” in our Perceptron.

This is a very straightforward process, as one can imagine.

1. Multiply each w1 and x1 value
2. Sum up the separate weighted feature values
3. Use that value as input to a non-linear function f(.)

But what is the non-linear function f we should use with perceptron.?

It means, that our final function -or predictor, activation function etc. – , must output +1, if the test object is a professional and -1 if an amateur.

But, given our random weight vector, f(x) does not always result to the correct classification

### Going back to Error Function

We need to find a way to find an f(x) so perfect, that each test object is classified correctly.

• Ym is the expected class (+1 or -1)
• WtXm is our prediction, where W is the vector containing w1,w2..

Since M is the set of misclassified points, wtxmym can never be a positive value, because if ym is +1, wtxm is -1, and reverse. For the sake of simplicity, We are negating the whole term, and get ourselves a nice positive value to minimize.

Now that we have a function of W, that we can minimize, we are almost done

Let’s proceed to training our model…

1. Initialize wold – could be a random value or 1/n
2. While there is a misclassified data point,
Pick a single “random” point Xm
Descent in the direction of the gradient at that single point

### Learning Rate

mostly effects the convergence speed

• But, with variable sized learning rate, convergence is proved for non-linearly separable sets as well.
• Good convergence speed can be achieved by n(t) = 1/t

## Problems of Perceptron

• Each update might lead to new misclassifications.
• Many solutions might exist
• Convergence might be slow
• Only solves linearly separable problems
• Does not give the same result every time, because xm is chosen randomly

## Tips and Notes

• After a single update error of that data point is always reduced
• If there is a solution – if the set is linearly separable- perceptron always finds the solution
• Perceptron solves Binary Separation Problems

## Bonus – Perceptron Code

```import scipy as sp
import numpy as np
import scipy.io as io
import pylab as pl

def train_perceptron(X, Y, iterations=200, eta=.1, option=0):
''' Trains a linear perceptron
Definition:  w, b, acc  = train_perceptron(X,Y,iterations=200,eta=.1)
Input:       X       -  DxN array of N data points with D features
Y       -  1D array of length N of class labels {-1, 1}
iter    -  optional, number of iterations, default 200
eta     -  optional, learning rate, default 0.1
option  -  optional, defines how eta is updated in each iteration
Output:      w       -  1D array of length D, weight vector
b       -  bias term for linear classification
acc     -  1D array of length iter, contains classification accuracies
after each iteration
Accuracy = #correctly classified points / N
'''
assert option == 0 or option == 1 or option == 2
np.random.seed(1)  # do not change
acc = np.zeros(iterations)
# include the bias term by adding a row of ones to X
X = np.concatenate((np.ones((1, X.shape)), X))
# initialize weight vector
weights = np.ones((X.shape)) / X.shape
for it in np.arange(iterations):
# indices of misclassified data
wrong = (np.sign(weights.dot(X)) != Y).nonzero()
# compute accuracy acc[it] (1 point)

acc[it] = (X.shape - wrong.shape) / X.shape

if wrong.shape > 0:
# pick a random misclassified data point (2 points)
point = wrong[np.random.randint(0, wrong.shape, 1)]
xm = X.T[point]
ym = Y[point]

# update weight vector (using different learning rates )
if option == 0:
new_eta = eta / (1 + it)
change = (xm * ym)
weights += new_eta * change
elif option == 1:
new_eta = eta
change = (xm * ym)
weights += new_eta * change

elif option == 2:
new_eta = eta * (1 + it)
change = (xm * ym)
weights += new_eta * change

b = -weights
w = weights[1:]
# return weight vector, bias and accuracies
return w, b, acc

```