Presenter:Koby Crammer

Advanced Online Learning for Natural Language Processing (Tutorial)

ACL 2008b

Online learning: incrementally train a classifier (or other model) based on a stream of training examples

distinguish between the classifier and the training algorithm

in some cases, treat teacher as adversary -- tries to stump the 'student'


  • equation15.png
  • equation16.png
  • equation17.png
  • equation26.png

discrete prediction:

  • equation19.png

continuous prediction:

  • equation27.png
    • Think of the sign as the label
    • Think of the abs value as the confidence

Loss functions

  • zero-one loss: 1 iff the answer is correct
  • hinge loss: equation21.png -- i.e., if the answer is right, then 0; otherwise, the magnitude of f(x).
  • exponential loss
  • log loss
  • etc.


  • initialize classifier
  • for each round t
    • recieve input instance
    • output prediciton
    • get feedback
    • compute loss
    • update prediction rule


  • minimize cumulative loss equation22.png

Linear classifiers

Define features and weights:


constrain: equation28.png


prediction: equation25.png

(Discussion of SVM-type stuff -- margins, etc.)

Why online learing?

  • Fast
  • Memory efficient - process one example at a time
  • Simple to implement
  • Formal guarantees – Mistake bounds
  • Online to Batch conversions
  • No statistical assumptions
  • Adaptive
  • but.. Not as good as a well designed batch algorithms

Update rules: we want an algorithm that maps equation29.png.

Using a weight-based model, we're mapping from equation30.png

(discussion of some simple update rules -- perceptron, etc)

Perceptron is basically minimizing the zero-one loss; but we can design variants that minimize other loss functions.

Problems with perceptron:

  • Margin of examples is ignored by update
  • Same update for separable case and inseparable case

Passive-Agressive Approach

  • minimize a loss function (maximize margin)
  • basic algorithm does not work in inseperable case
  • Enforce a minimal non-zero margin after the update (cf perceptron, which does not guarantee margin after approach)
  • In particular :
    • If the equation31.png, then do nothing
    • If the equation32.png, update such that the margin after the update is enforced to be 1.

Define a dual space, to be the space of weight vectors. I.e., a given point in this "version space" identifies a single classifier.

For a given training sample, we can determine the subspace of this dual space that will classify it correctly. I.e., the set of all models that get that sample correct will be in that region. (We can enforce a margin of 1 by finding the subspace of this dual space that identifies the sample correctly with a margin of at least one.)

So we start with a point equation33.png in the version space. Then, for each new training sample, we find the line $`1 le y_t(w_t cdot x_t$`, which seperates the space into classifier models that classify the training sample with an acceptable margin, and those that don't. For each new training sample, we project equation34.png onto this line.

Agressive step update:

equation35.png such that equation36.png

This can be solved analytically. If the constraint is already satisfied, then the solution is trivial: stay where we are. If the constraint is not satisfied, then we choose the closest point that does satisfy it. I.e., make the minimal change to the model that will allow us to get the right solution with an acceptable margin.

Comparing PA with perceptron: in both cases we have: equation37.png. For the perceptron, equation38.png is basically the zero-one loss function. For the PA, equation38.png is proportional to the hinge loss function (it's scaled by equation39.png).

So there appears to be a tight connection between the algorithm update and the loss function we're optimizing on.

Unseperable case

Modify equation38.png -- i.e., rather than projecting all the way to the part of the model space that gets the new example right, just project 'part way' there:

equation42.png such that equation40.png, equation41.png

Two ways we can go with this:



Relation to SVMs

This will basically converge to the same thing as SVMs if we re-iterate the examples repeatedly -- i.e., if we have memory.

Complex Learning Tasks

  • binary classification
  • n-way classification, no ordering or relation
  • n-way classification, ordered (eg rate movies 1-5)
  • n-way classification, hierarchical relations between labels
  • n-way classification, complex tree relation between labels
  • multi-class classification (either as ranking output, or set output)
  • sequence classification (eg np chunking)
  • sentence compression -- deleting selected words
  • alignment
  • complex structure outputs -- eg parse trees, feature structures, dependency parsing

For these tasks..

  • how do we want to evaluate? what's a good loss function?


  • we can't just eliminate the wrong answer
  • structure over labels
  • non-trivial loss functions
  • complex input/output relation
  • non-trivial output values
  • interactions between different sub-decisions
  • wide range of features

conventional approach: treat it as a labelling problem. I.e., we're mapping an input sequence equation46.png to an output sequence equation47.png, where we might have a many-to-one relation between x's and y's. (This isn't general enough for all the problems we want to do, but it's a start.)

Outline of a Solution

We need:

  • a quantitative evaluation of predictions. I.e., a loss function. This will be application dependent.
  • a class of models. I.e., a set of labeling functions. This generalizes linear classification.
  • A learning algorithm. We'll look at extending both the perceptron and the passive-aggressive algorithms.

Loss functions

Examples: hamming distance, Levenshtein Distance, etc. equation49.png


For multiclass classification: we have k prototypes equation50.png, corresponding to the k labels. For each class equation51.png, compute equation52.png, and take the one that gives the highest score. This is basically dividing the space into hyper-cones (really more like hyper-pyramids), rather than separating it with a single hyperplane.

Another alternatie: define a feature vector over the entire input/output pair.


This is fairly powerful -- we're not restricting anything to be local, over the input vector or the output vector. Then, to label, compute:


For prediction:


Doing a naive search is expensive, since we have an exponential number of possible output values. So if we have no restrictions on equation56.png, then doing inference can be intractable. So we restrict the form of equation56.png.

Inference: project each F(x,y) onto the weight vector w, and choose the y that is gives the max projection length.

A good equation57.png gets things in the right order, and maintains large margins between the best answer and the next-best answers.

Using linear models. These are modular, in the sense that:

  • we have a factored scoring function
  • we can swap in different loss functions
  • we can use kernels over the features

Example: multiclass multilabel perceptron

  • input: `$x in textbb{R}^n

    System Message: WARNING/2 (<string>, line 207); backlink

    Inline interpreted text or phrase reference start-string without end-string.

  • output: `$hat{y} = (w_1cdot x, ..., w_k cdot x)

    System Message: WARNING/2 (<string>, line 208); backlink

    Inline interpreted text or phrase reference start-string without end-string.

  • feedback: equation58.png is a set of labels

  • loss: equation59.png -- where R is the ranking of the labels in the output.

  • if equation60.png, update weight vectors. I.e., if it gets the example right, don't do anything.

  • Note: the output and the feedback are not the same space!


  • find the error-set: the set of equation61.png pairs where equation62.png and equation63.png and `$w_rcdot x le w_scdot x`$

    System Message: WARNING/2 (<string>, line 216); backlink

    Inline interpreted text or phrase reference start-string without end-string.

  • define a matrix equation64.png whose rows are correct labels and columns are incorrect labels such that:

    • equation65.png
    • if equation66.png then equation67.png
    • equation68.png
  • Define an update rate equation69.png for each label where:

    • if equation70.png: equation71.png
    • if equation72.png: equation73.png
  • the matrix equation64.png may be underspecified by the constraints; in this case, we can pick anything -- it will affect how we move through the space towards the right answer. doing uniform update seems to work better than doing max update. but max update is faster.

Example: passive-agressive for multiclass multilabel

Basically, we just want to add a margin to what we did with the perceptron version.

So rather than doing no update in the case where we got it right, we'll do no update only if there's a big enough margin between the right one and the wrong ones.

But we know that some mistakes are worse than others -- the loss function is not constant. So use the loss function as our margin.

again, if it's non-separable, we can add slack variables. Also, for certain types of features, it's unlikely that it will be non-separable. finally, we could decide to drop some constraints if necessary.

another advantage of dropping constraints: for complex outputs, we may have an exponential number of constraints. by dropping some, we can find something that works for the remaining ones... e.g., for each new input, run our model and see what we predict, and then take the constraints that say that the correct answer is better than our predicted answer.

see also: "loss augmented prediction"