Dominik Farhan

Perceptron with hidden bias

Intro

Let's discuss the perceptron and a trick that can be used to simplify (a bit) an analysis of many supervised ML algorithms.

So what is perceptron?

It's probably the simplest binary classification algorithm. It tries to find a decision boundary that can be modeled as a D1D-1-dimensional plane where DD is the number of features in input data. So we are basically cutting our space into two pieces. The class of a sample is then decided based on the part of the space to which the sample falls.

A 2D example is in the following picture:

Notice how close is one of the blue points to the boundary. This is characteristic property of perceptron. It is not in any way optimizing where the boundary is placed. All it does is dividing the plane so that training examples are properly classified. Because of that, you can get very screwed boundaries like the following:

Also it won't work for data that aren't linearly separable:

It looks pretty useless, right? And yes, it's pretty weak but it's also super simple. All you need to have for this algorithm is a vector of weights w\textbf{w} and scaler-variable bb called bias. Let's define the possible classes for our data as {1,1}\{1, -1\}. Then we can define prediction for xi\textbf{x}_i as follows yi=sign(xiTw+b).y_i = \text{sign}(\textbf{x}_i^T \textbf{w} + b).

Observe that whenever yiy_i is incorrect the following holds tiyi<0t_i y_i < 0. This is a good moment to pause and see that the inequality really works.

With it in our toolbox we can define loss as L=max(0,tiyi)=max(0,ti(xiTw+b)).L = \max(0, - t_i y_i) = \max(0, - t_i (\textbf{x}_i^T \textbf{w} + b)). This we need to minimize across all training examples. It's nice to notice that the loss can be also written as L=ReLU(tiyi),L = \text{ReLU}(- t_i y_i), The training isn't much harder as we'll see in a second. But prior to it let's discuss hiding the bias.

Hiding bias

Hiding bias is a neat trick that will simplify both analysis and the update rule. We are going to hide the bias in w\textbf{w} so that we will only need to derive update rules for this one vector. Note that this method can be used with many other ML algorithms too.

Let's rewrite the dot product in prediction as xiTw+b=x1w1+x2w2++xDwD+b=x1w1+x2w2++xDwD+1b.\textbf{x}_i^T \textbf{w} + b = x_1 w_1 + x_2 w_2 + \dots + x_D w_D +b = x_1 w_1 + x_2 w_2 + \dots + x_D w_D + 1 b.

How is this useful? Observe the last term 1b1 b. There is a multiplication of two scalars like in the preceding terms. This allows us to rewrite the whole thing as yet another dot product (x1,x2,,xD,1)(w1,w2,,wD,b).(x_1, x_2, \dots, x_D, 1) \cdot (w_1, w_2, \dots, w_D, b).

Suddenly bb disappeared and we are dealing with new weights and xx vectors with dimensionality D+1D+1. Prediction is now defined as yi=sign(xiTw)y_i = \text{sign}(\textbf{x}_i^T \textbf{w}) and similarly for the loss.

Putting it all together

With bias hidden there is no need to think much about updating it. We want to minimize max(0,tixiTw)\max(0, - t_i \textbf{x}_i^T \textbf{w}) with a bit of calculus, we arrive at the following update rule ww+tixi\textbf{w} \leftarrow \textbf{w} + t_i \textbf{x}_i which is used whenever the example is classified incorrectly. For correctly classified examples we do nothing.

It's good to note that using derivatives to get the update rule is a bit overkill. I recommend you to try this on paper (it should give better intuition than my explanation). Imagine what's happening. We have two vectors w,xi\textbf{w}, \textbf{x}_i and in prediction we are basicaly calculating their cosine similarity. The result is positive for an angle between vectors which is less than 9090^\circ and negative for more than 9090^\circ. Now when there is an incorrectly classified example the angle between w,xi\textbf{w}, \textbf{x}_i is either too small or too great. By adding tixit_i \textbf{x}_i to the weights vector we are rotating it in a direction that is favorable to us. For ti=1t_i = 1 the angle is getting smaller, for ti=1t_i = -1 the vectors are drifting more appart.

Sidenotes

If you're interested in the code that was used to generate pictures in this article, you can look here.

If you found a mistake or want to share some thoughts on perceptron, you can write to me here.

If you are looking for more on perceptron, you may enjoy the following