Head-Driven Statistical Models for Natural Language Processing

Michael Collins 1999

In his dissertation, Collins builds on the idea that the structure of a parser's output should be decoupled from the probabilistic model used to generate it. In particular, Collins presents a history-based parser that decomposes parse trees into a sequence of "decisions" that preserve specific linguistically motivated lexical and non-lexical dependencies. In Collins' "Model 2" parser, there are four decision types:

  • Start. Choose the head-word for the sentence.
  • Head projection. Build the spine of a tree.
  • Subcategorization. Generate a phrase's complements and adjuncts.
  • Dependency. Choose the head word for a complement or an adjunct.

Although Collins describes his parser in terms of a history-based sequence of decisions, it can also be thought of as a complex tree transformation. In particular, the figure below gives an example showing how Collins' "Model 2" parser can be expressed as a transformation from the canonical Treebank-style encoding to a new encoding that introduces additional structure. Each node in this transformed tree corresponds to a decision variable in Collins' model. Under this transformed encoding, Collins' "Model 2" parser can implemented as a PCFG. [1]

Collins argues that two particularly important criteria for deciding how to decompose a structured problem are discriminative power and compactness. The discriminative power of a decomposition reflects whether its local subproblems' parameters include enough contextual information to accurately choose the correct decision. Collins points out that simple PCFGs fail in this respect, because they are insensitive to lexical and structural contextual information that is necessary to make correct local decisions. The compactness of a decomposition measures the number of free parameters that must be estimated. The more parameters a model has, the more training data will be required to accurately train those parameters. Thus, given two models with equal discriminative power, we should prefer the more compact model.

In order to ensure that a model has sufficient discriminative power, Collins suggests that the notion of locality should be used to determine what the dependencies should be between local subproblems. In particular, the decomposition should preserve structural connections between any variables that are within each others' domain of locality.


Collins' "Model 2" Parser as Tree Transformation. This figure illustrates how Collins' Parser can be modelled using the notion of encoding transformation. The structure (a) shows the canonical parse tree encoding for a simple example sentence. The structure (b) shows an encoding of the same parse tree that reflects Collins' choice of decomposition for the parsing problem. The elliptical node "S(bought)" corresponds to the start decision, and consists of a phrase label ("S") and a head word ("bought"). The square white nodes correspond to head projection decisions; each contains the phrase label, headword, and parent's phrase label for a single constituent. The shaded nodes corespond to subcategorization decisions; each contains a phrase label, a parent phrase label, a headword, a direction, a distance metric, and a set of sub-categorized arguments. The black circle nodes represent STOP tokens for the sub-categorization frames. The octagonal white nodes correspond to dependency decisions, and select head words for complements and adjuncts.

[1]Collins' model makes use of linear interpolated backoff to reduce the adverse effect of data sparsity. In order to accurately implement Collins' parser, the PCFG would need to implement these backoff methods, along with a number of additional transformations that have been glossed over here.