Investigating Parser Performance on Discussion Forum Posts

Foster 2010

Jennifer Foster

Corpus Anotation

  • Manual tokeniation & spelling correction
  • Parse trees from Bikel parser
  • difficult decisions documented
  • Two passes


Baseline parse: Berkley parser (split/merge). 5th order grammar. POS tagging is done by the grammar.


  • untokenized w/ spelling errors: F=69.6
  • gold tokenization w/ spelling errors: F=72.4 (+2.8) - missing apostraphes (eg didnt) cause issues
  • gold tokenization w/ spelling corrected: 74.75 (+2.35) - mis-spelled function words cause significant problems. e.g. "whpo"

Issues for the parer

  • Subject ellipsis
  • adjective/adverb switch (eg use "bad" rather than "badly")
  • unseen acronyms (lol)
  • abbreviations (cos=because)
  • non-standard capitalization (eg all-caps, missing caps)
  • unknown idioms (eg "spot on.")
  • run-on sentences

Improving Performance

Transform test:

  • Fix caps, expand abbreviations, remove interjections (lol), etc.

Transform dev set: retrain the parser with modifications to portions of the training data:

  • proper nouns uncapitlized
  • -ly removed from adverbs

Self-training/co-training experiments

Try using a different corpus for training (eg Brown, switchboard)

Trying parsing with multiple grammars