Investigating Parser Performance on Discussion Forum Posts
- Manual tokeniation & spelling correction
- Parse trees from Bikel parser
- difficult decisions documented
- Two passes
Baseline parse: Berkley parser (split/merge). 5th order grammar. POS tagging is done by the grammar.
- untokenized w/ spelling errors: F=69.6
- gold tokenization w/ spelling errors: F=72.4 (+2.8) - missing apostraphes (eg didnt) cause issues
- gold tokenization w/ spelling corrected: 74.75 (+2.35) - mis-spelled function words cause significant problems. e.g. "whpo"
Issues for the parer
- Subject ellipsis
- adjective/adverb switch (eg use "bad" rather than "badly")
- unseen acronyms (lol)
- abbreviations (cos=because)
- non-standard capitalization (eg all-caps, missing caps)
- unknown idioms (eg "spot on.")
- run-on sentences
- Fix caps, expand abbreviations, remove interjections (lol), etc.
Transform dev set: retrain the parser with modifications to portions of the training data:
- proper nouns uncapitlized
- -ly removed from adverbs
Try using a different corpus for training (eg Brown, switchboard)
Trying parsing with multiple grammars