Tags: |
---|

Distributed Training Stragegies for the Structured Perceptron

Multiclass/structured perceptron

Repeat until convergence:

- Consider training instances in order
- Compute update to weight vector w
- For perceptron: Move weighted vector towards the vector that would give us the right answer for this instance

- Compute update to weight vector w

Error

Error processing latex u'k<\\frac{R^2}{\\gamma^2}'/bin/sh: /sw/bin/latex: No such file or directory

k = training errors R = radius gamma = margin of best unit separating weight vector

So percetron will converge if the data is seperable

Why do we care about perceptron?

- Fast: no partition function & quick convergence
- Robust to approximations in inference. still works with approximate inference.
- Sparsity with respect to unobserved features
- easy to implement
- close to state-of-the-art performance

Training is slow; we'd like to distribute it.

Batch learning: * Exact learning via mapreduce [chu et al 07]: gradient independence * Parameter mixtures * Trade-off: resource consumption vs exactness

### Prior Work

For distributed learning:

- Break the training set into pieces
- Run learning on those subsets of the data
- Average the weights that the individual models learned
- For Maxent:
- To work, requires "stability" with respect to the parameters
- Good empirical results

For perceptrons, this approach doesn't work well

### Iterative Parameter Mixtures

- Run a single epoch over each shard of the training set.
- Take the average
- Update the individual models
- Repeat

Will this converge/be bounded in training errors? Yes. Sort-of. See the paper for details.

### Experiments

NER: iterative parameter mix gets very good score (same as single training model) very quickly.

Same for Depenency Parsing.

### Asynchronous Perceptron

Weight vector at a shared location; lots of network traffic. You can get some minor differences w/ the single-system training, depending on what the potential delay is.