Tags: |
---|
Distributed Training Stragegies for the Structured Perceptron
Multiclass/structured perceptron
Repeat until convergence:
- Consider training instances in order
- Compute update to weight vector w
- For perceptron: Move weighted vector towards the vector that would give us the right answer for this instance
- Compute update to weight vector w
Error
Error processing latex u'k<\\frac{R^2}{\\gamma^2}'/bin/sh: /sw/bin/latex: No such file or directory
k = training errors R = radius gamma = margin of best unit separating weight vector
So percetron will converge if the data is seperable
Why do we care about perceptron?
- Fast: no partition function & quick convergence
- Robust to approximations in inference. still works with approximate inference.
- Sparsity with respect to unobserved features
- easy to implement
- close to state-of-the-art performance
Training is slow; we'd like to distribute it.
Batch learning: * Exact learning via mapreduce [chu et al 07]: gradient independence * Parameter mixtures * Trade-off: resource consumption vs exactness
Prior Work
For distributed learning:
- Break the training set into pieces
- Run learning on those subsets of the data
- Average the weights that the individual models learned
- For Maxent:
- To work, requires "stability" with respect to the parameters
- Good empirical results
For perceptrons, this approach doesn't work well
Iterative Parameter Mixtures
- Run a single epoch over each shard of the training set.
- Take the average
- Update the individual models
- Repeat
Will this converge/be bounded in training errors? Yes. Sort-of. See the paper for details.
Experiments
NER: iterative parameter mix gets very good score (same as single training model) very quickly.
Same for Depenency Parsing.
Asynchronous Perceptron
Weight vector at a shared location; lots of network traffic. You can get some minor differences w/ the single-system training, depending on what the potential delay is.