Depending on the details, and the mood and theoretical biases of the author, linear classifiers may be known as
The variable terminology was created by the convergence of two lines of research. One (neural nets, perceptrons) attemptsto model neurons in the human brain. The other (logistic regression) evolved from standard methods for linear regression.
Since real class boundaries are frequently not linear, these units are often
You've probably seen linear regression or, at least, used a canned function that performs it. However, you may well not remember what the term refers to.
Linear regression is the process of fitting a function f to a set of (x,y) points. This is usually done by the "least squares" method: minimizing the sum of squared differences, i.e. the L2 norm. Because the error function is a simple differentiable function of the input, there is a well-known analytic solution for finding the minimum. (See the textbook or many other places.) There are also so-called "robust" methods that minimize the L1 norm (sum of absolute differences). These are a bit more complex but less sensitive to outliers (aka bad input values).
From Wikipedia
The basic set-up for a linear classifier is shown below. Look at only one class (e.g. pink for cat). Each input example generates a feature vector (x). The four pixel values x1,x2,x3,x4 are multiplied by the weights for that class (e.g. the pink cells) to produce a score for that class. A bias term (b) is added to produce the weighted sum w1x1+....wnxn+b.
(from Andrej Karpathy cs231n course notes via Lana Lazebnik)
In this toy example, the input features are are individual pixel values. In a real working system, they would probably be something more sophisticated, such as SIFT features or the output of an early stage of neural processing. In a text classifcation task, one might use word or bigram probabilities.
To make a decision, we compare the weighted sum to a threshold. The bias term adjusts the output so that we can use a standard threshold: 0 for neural nets and 0.5 for logistic regression. (We'll assume we want 0 in what follows.) In situations where the bias term is notationally inconvenient, we can make it disappear (superficially) by introducing a new input feature x0 which is tied to the constant 1 and replacing the bias term with a new weight w0.
It's possible that our input data is "linearly separable," i.e. we can draw a line with all inputs of one type (e.g. cats) on one side and all inputs of the other type (non-cats) on the other. More often, there is some mixing of data points near the boundary, so the best linear boundary will leave some points on the wrong side.
We can make this process look more like a neuron or a circuit component by drawing it as follows:
(from Mark Hasegawa-Johnson, CS 440, Spring 2018)
Notice that the output is passed through the sign (sgn) function at the end. This gives us a single function f(x)=sgn(w1x1+....wnxn+b) which returns one of two values (1 and -1) indicating our decision (e.g. cat or not cat). We'll follow the neural net terminology and call this final function the "activation function."
There are a number of standard choices for the activation function, including:
(from Mark Hasegawa-Johnson)
The logistic and tanh functions are essentially the same, except that tanh returns values between -1 and 1, where the logistic returns values between 0 and 1. Wikipedia has an excessively comprehensive list of alternatives.
The activation does not affect the decision from our single, pre-trained unit. Rather, the choice has two less obvious consequences:
In naive Bayes, we feed a feature vector into a very similar equation (esp. if you think about everything in log space). What's different here?
In naive Bayes, we assumed that features were conditionally independent, given the class label. This may be a reasonable approximation for some sets of features, but it's frequently wrong enough to cause problems. For example, in a bag of words model, "San" and "Francisco" do not occur independently even if you know that the type of text is "restaurant reviews." Logistic regression (a natural upgrade for naive Bayes) allows us to handle features that may be correlated. For example, the individual weights for "San" and "Francisco" can be adjusted so that the pair's overall contribution is more like that of a single word.
So far, we've assumed that the weights w1,…,wn and the bias term were given to us by magic. In practice, they are learned from training data.
Suppose that we have a set T of training data pairs (x,y), where x is a feature vector and y is the (correct) class label. Our training algorithm looks like:
- Initialize weights
- For each training example (x,y), update the weights to improve the changes of classifying x correctly.
It is common to cycle through the training examples more than once. The training algorithm would then look like this, where an "epoch" is one pass through the training data:
- Initialize weights
- For each epoch
- For each training example (x,y), update the weights to improve the changes of classifying x correctly.
Weights can unusually be initialized to some neutral value (e.g. zero). (There are exceptions involving hidden parallel units in neural nets.)
Datasets frequently have strong local correlations, because they have been formed by concatenating sets of data (e.g. documents) that each have a strong topic focus. So it's often best to process the individual training examples in a random order.
Let's suppose that our unit is a classical perceptron, i.e. our activation function is the sign function. Then we can update the weights using the "perceptron learning rule". Suppose that x is a feature vector, y is the correct class label, and y' is the class label that was computed using our current weights. Then our update rule is:
- If y' = y, do nothing.
- Otherwise wi=wi+α∗y∗xi
The "learning rate" constant α controls how quickly the update process changes in response to new data.
There's several variants of this equation, e.g.
wi=wi+α∗(y−y′)∗xi
When trying to understand why they are equivalent, notice that the only possible values for y and y' are 1 and -1.
With this update rule, our training procedure will converge if
Pictures illustrating how neural nets see the world. Some aspects of the world are reproduced well, but other things are seriously wrong.