Tags: |
---|

Head-Driven Statistical Models for Natural Language Processing

In his dissertation, Collins builds on the idea that the structure of a parser's output should be decoupled from the probabilistic model used to generate it. In particular, Collins presents a history-based parser that decomposes parse trees into a sequence of "decisions" that preserve specific linguistically motivated lexical and non-lexical dependencies. In Collins' "Model 2" parser, there are four decision types:

**Start.**Choose the head-word for the sentence.**Head projection.**Build the spine of a tree.**Subcategorization.**Generate a phrase's complements and adjuncts.**Dependency.**Choose the head word for a complement or an adjunct.

Although Collins describes his parser in terms of a history-based sequence of decisions, it can also be thought of as a complex tree transformation. In particular, the figure below gives an example showing how Collins' "Model 2" parser can be expressed as a transformation from the canonical Treebank-style encoding to a new encoding that introduces additional structure. Each node in this transformed tree corresponds to a decision variable in Collins' model. Under this transformed encoding, Collins' "Model 2" parser can implemented as a PCFG. [1]

Collins argues that two particularly important criteria for deciding
how to decompose a structured problem are discriminative power and
compactness. The *discriminative power* of a decomposition
reflects whether its local subproblems' parameters include enough
contextual information to accurately choose the correct decision.
Collins points out that simple PCFGs fail in this respect, because
they are insensitive to lexical and structural contextual information
that is necessary to make correct local decisions. The
*compactness* of a decomposition measures the number of free
parameters that must be estimated. The more parameters a model has,
the more training data will be required to accurately train those
parameters. Thus, given two models with equal discriminative power,
we should prefer the more compact model.

In order to ensure that a model has sufficient discriminative power,
Collins suggests that the notion of *locality* should be used to
determine what the dependencies should be between local
subproblems. In particular, the decomposition should preserve
structural connections between any variables that are within each
others' domain of locality.

[1] | Collins' model makes use of linear interpolated backoff to reduce the adverse effect of data sparsity. In order to accurately implement Collins' parser, the PCFG would need to implement these backoff methods, along with a number of additional transformations that have been glossed over here. |