*
* Lecture notes by Edward Loper
*
* Course: CIS 530 (Intro to NLP)
* Professor: Steven Bird
* Institution: University of Pennsylvania
*
# http://www.cis.upenn.edu/~cis530
[10/09/00 01:48 PM]
> Chart Parsing
It's dynamic programming!
>> Always use the fundamental and initialization rules:
>>> Fundamental rule
# If a chart contains:
# i, j, X0: X1 \ldots \bullet Xi \ldots Xn
# j, k, Xi: Y1 \ldots Ym \bullet
# Then add:
# i, k, X0: X1 \ldots Xi \bullet \ldots Xn
i.e., if one arc wants to eat an Xi, and there's an Xi at the
appropriate place, then eat it.
Note that nothing is deleted from the chart
>>> Initialization rule (I_0)
Add edges for each word, according to a lexicon. For example, for the
word "the" at the beginning of a sentence, add:
# 0, 1, ART: the
>> Vary the rule invocation strategy
To do top-down or bottom-up, we just add new initialization and
recursion rules. I_{TD}, T. D. Rule, B. U. Rule\ldots
>>> Top-down
For top-down, run I_0, then run I_{TD}:
# For each rule:
# X \to X1 \ldots Xn
# Where X is a category that can span the chart (S), add:
# 0, 0, X: \bullet X1 \ldots Xn
i.e., Add a zero-length edge at the beginning of the sentence, that
wants to consume a S.
Then run the fundamental rule and the T. D. rule:
# When adding an edge:
# i, j, X \to \ldots \bullet Y \ldots
# then, for each rule Y \to Y1 \ldots Yn, add:
# j, j, Y \to \bullet Y1 \ldots Yn
i.e., if we have an edge where we're trying to consume a Y, add
zero-length edges that can output a Y.
>>> Bottom-up
First, run I_0. There is no special I_{BU}.
Then run the fundamental rule and the B. U. rule:
# When adding an edge:
# i, j, Y: Y1 \ldots Ym \bullet
# then for every rule X \to Y X2 \ldots Xn, add:
# i, i, X: \bullet Y X2 \ldots Xn
i.e., if we just found a Y, then look for rules that consume Y's, and
add arcs for them.
>> Hybrids
>>> Type-sensitive parsing
You can do bottom-up parsing of one type of word (e.g., verbs) and
top-down parsing of everything else. Make words type sensitive.
E.g., replace bottom up with:
# When adding an edge:
# i, j, V: Y1 \ldots Ym \bullet
# then for every rule VP \to V X2 \ldots Xn, add:
# i, i, VP: \bullet V X2 \ldots Xn
And top-down with:
# When adding an edge:
# i, j, X \to \ldots \bullet Y \ldots
# where Y \neq VP, for each rule Y \to Y1 \ldots Yn, add:
# j, j, Y \to \bullet Y1 \ldots Yn
(or something like that)
>> Completeness
If we change the rule strategy, can we be sure we'll get everything?
If there is a correct parse, are we guaranteed to get it?
> Improving Grammar Coverage
What about features? Add features to rules? E.g., allow something
like:
# X[\pm f]
and have rules like:
# X[\alpha f] \to \ldots Y[\alpha F] \ldots
> Partial Parsing
* chinks and chunks
* tag transitions (NN-VB closes an NP)
* cascade - sequence of strata
* use regexps directly on pos tags: NP = DT JJ* NN
[10/18/00 01:40 PM]
> Dealing With Words
>> Meaning
3 Types:
* referential (thing/event/etc)
* social (info about speaker)
* affective (info about speaker's attitudes/feelings)
>> Resources
>>> Wordnet
http://www.cogsci.princeton.edu/~wn/
has an API for C.
synsets related to hypernyms, hyponyms, etc.
meronym - part/whole (door is meronym of house).
holonym - whole/part (house is holonym of door).
Contains LDOCE, Collin's dictionary
>>> CELEX
http://www.kun.nl/celex/
pronunciation lexicon for english, dutch, german. gives pronunciation
(ipa) with stress, frequency, part of speech, etc.
>> Acquisition
How do we find the words to begin with?
child language acquisition: children need to find word boundries.
What techniques do they use? Use stress? ('puter for computer)
exploit distributional regularities (phonotactic constraints) of
words..
form word hypotheses, and minimize size(lexicon)+size(encoding of
sentences)
Build a trie, or letter tree, or prefix tree. looks like a huffman
encoding. (trie is from "retrieval")
http://hissa.nist.gov/dads/HTML/trie.html
>> Spelling Errors
>>> Soundex
A way to deal with mis-spellings of known words
1. retain the first letter of the name, and drop all occurances of a,
e, h, i, o, u, w, and y in other positions.
2. assign the following numbers to the remaining letters after the
first: bfpv\to1, cgjkqsxz\to2, dt\to3, mn\to5, r\to6
3. if two or more letters with the same code were adjacent in the
original name, omit all but the first.
4. convert to the form "letter, digit, digit, digit" by
stripping/padding (with zeros)
edward \to edrd \to e363
>>> Levenshtein Distance
A way to cluster similar words.. Gives a distance measure between any
two words.
Alignment: line them up to maximize matched letters:
# In--dustry
# Interest--
# in....st..
Trace:
# Industry
# -substitute D by T-
# -shift \ldots-
# \ldots
# Interest
count 1 for insertion/deletion, 2 for substitution:
# In--dustry
# Interest--
# 0011220011 \to Levenshtein distance = 8
Every string has k-nearest neighbors.
Levenshtein distance could be weighted.. some substitutions (eg. dt)
might be penalized less than others (eg. qr).
[10/23/00 01:39 PM]
> Statistical NLP
>> Motivation
Where might we want a stochastic grammar? i.e., a symbolic grammar
with probabilities assocaited with each rule?
* acquisition
* variation
* change
* adult monolingual speaker? must show evidence..
There are unusual word sequences. A broad coverage grammar must
assign structures -- we simply get too many structures for any decent
size sentence.. How do we represent which ones are better..?
Use a weighted grammar instead of parsing preferences.
>> Probability
Bayes theorem
# P(B|A)P(A)
# P(A|B) = --------------
# P(B)
>>> Example
# A = instance of construction
# B = program reports a hit
# P(A) = 0.001
# P(B|A) = 0.9
# P(B|~A) = 0.01
# P(B) \approx 0.0109
# P(A|B) = P(B|A)P(A)/P(B) \approx 0.0826
[10/25/00 01:27 PM]
Make best inference based on the available data and any prior
knowledge, and revise our position as new information comes to light.
Prior probabilities: P(hyp)
likelihood function: P(data|hyp) -- learn via experiments
Posterior probability: P(hyp|data) -- use bayes' theorem
Book: Sivia (1996) Data Analysis: A bayesian Tutorial (oxford
university press)
> Information Theory
>> The Noisy Channel Model
* Communicate messages over a channel
* Maximize throughput & accuracy in presence of noise
* Compression vs. accuracy
# w \to encoder -x\to channel -y\to decoder \to w'
>> Entropy
# H(p) = \sum p(x) log p(x)
Entropy can be thought of as the average length of a message needed to
transmit the outcome of some random variable.
Entropy of alphabet where P(p)=P(k)=P(u)=P(i)=1/8 and P(t)=P(a)=1/4:
# (+ (* 4 (/ 3.0 8)) (* 2 (/ 2.0 4))) = 2.5
>>> Relative Entropy
# D(p\|q) = \sum p(x) log(p(x)/q(x))
Consider the two probability distributions:
# p t k a u i
# q 1/8 1/4 1/8 1/4 1/8 1/8
# p 1/8 1/4 1/8 1/8 1/8 1/4
# D(p\|q) = (-\sum p(x)log q(x)) - (-\sum p(x)log p(x))
# = \sum p(x) log(p(x)/q(x))
In our example:
# D(p\|q) = 1/8 log(1/2) + 1/4 log(2)
# = -1/8 + 1/4 = 1/8
So the relative entropy D(p\|q) = 1/8. This is the extra message
length needed to transmit, on average.
Inefficiency of assuming that the distribution is q(x), when it's
really p(x).
>>> Cross Entropy
>>> Perplexity
perplexity = 2^entropy
> Markov Models (Ch. 9 Manning)
# p(xi | xi-1, xi-2, \ldots, x1, x0)
Unless we have a LOT of data, this is pretty sparse.
Sparse data problem.
So approximate:
# p(xi | xi-1, xi-2, \ldots, x1, x0)
# \approx p(xi) ; monogram model
# \approx p(xi | xi-1) ; bigram model
# \approx p(xi | xi-1, xi-2) ; trigram model
Bigram model can be trivially converted into a finite state model. A
markov model (i.e., FSM with probabilities on transitions).
Try using backoff: use higher order models where possible, lower order
models when necessary.
Assign probabilities to:
* nodes (chance of starting there)
* arcs: chance of transitioning to a state
* labels on arc: chance of taking each letter if you take that
transition.
# [0.1] ---\to [0.8]
# \uparrow | a: 0.4
# |0.7 1.0| b: 0.5
# | 0.3 \downarrow c: 0.1
# [0.1] ---\to [0.0]
[11/01/00 02:34 PM]
# a b c d e \epsilon \to b c d e f
# split \epsilon among extra chars (f)
Defining a Markov Model:
# S: {state} set of states
# K: {letter} output alphabet
# \Pi: S\to p Initial state probabilities
# A: S\times S\to p State transition probabilities
# B: S\times S\times K\to p Output letter probabilities
>> Finding P(observation)
Use a Trellis. (Forwards algorithm)
# t=0 t=1 t=2
# s1 0.8 0.126 0.17433
# s2 0.3 0.574 0.20993
P(ab)=0.3836
(+ (* .8 .3 .5) (* .2 .3 .1))
(+ (* .8 .7 .9) (* .2 .7 .5))
(+
(+ (* .126 .3 .5) (* .574 .3 .9))
(+ (* .574 .7 .5) (* .126 .7 .1)))
Draw Trellis with arrows/probabilities?
#
# S1 .8 ---\to.32 ---\to
# \backslash\_\_\slash \backslash\_\_\slash
# \slash \backslash \slash \backslash
# S2 .2 ---\to.13 ---\to
>> Find most likely path (Viterbi algorithm)
Simply find most likely partial path? Keep back pointer to its most
likely preceeding state..
Partial path probability:
# \delta_j(t+1) = max_{1\leq i\leq n} \delta(t)a_{ij}b_{ijo_t}
Note that a_{ij}b_{ijo_t} is probability of transitioning from i to j
with output o_t..
\delta keeps track of the probability of the path ending at the given
point..
Backpointers:
# \psi_j(t+1) = argmax_{1\leq i\leq N} \delta(t)a_{ij}b_{ijo_t}
(argmax means "give me the i for which the expression is maximized")
>> What paramaters A, B, \Pi maximize P for given observations?
(Expectation Maximization algorithm / Baum-Welsh algorithm)
Iteratively adjust A, B, \Pi to make training data more likely. Throw
a lot of training data at it, keep track of how many times we take
different arcs, adjust probabilities accordingly.. Hill-climbing.
> Collocations
We could simply look for frequent bigrams, but we get a lot of bigrams
like "of the" and "has been".. We could use a stop list, but the real
problem is that we want to deal with is whether they occur together
more often then chance: take into account the frequencies of the
individual words.
Consider the following graph:
# | B W |
# -+--------+----
# M| 5 5 | 10
# F| 15 15 | 30
# -+--------+----
# | 20 20 |
The probabilities in the squares are exactly what we expect. So
\chi^2=0.
Formula for expectation:
# row i total col j total
# E(i, j) = ------------ \times ----------- \times total
# total total
# E(i, j) = P(i)P(j) \times total
If we observe:
# | B W |
# -+--------+----
# M| 4 6 | 10
# F| 16 14 | 30
# -+--------+----
# | 20 20 |
Then the probabilities no longer match our expectations.
Find:
# (Aij - Eij)^2
# -------------
# Eij
gives:
# | B W
# -+----------
# M| 1/5 1/5
# F| 1/15 1/15
\chi^2 is simply the sum of these numbers..
[11/06/00 01:51 PM]
> Hand Crafted System Notes
>> Training & Testing
- sophistication of model
- representativeness of training data
- overfitting
- details of parameterization
>> Tradeoffs
- simplicity vs coverage (threshold effects)
>> Brittleness
- small changes in earlier parts of regexp can strongly affect later
parts.
- overlapping effects: chaotic to maintain
>> Not scalable
>> Easier to relate to theory
> Building Language Models
Take data generated by an unknown PDF, and make inferences about that
PDF.
1. divide data into equivlanace classes: sparse data problem
2. find estimators for equivelance classes
3. combine multiple estimators
Choice of classificatory features: we're basically dividing data into
"bins," where we assume that everything in the bin functions the
same.. e.g., when tagging, put all occurances of the same word in the
same bin for a 0th order tagger, or all occurances of the same word
with the same prior tag in the same bin for a 1st order tagger.
Strict n-gram modelsl have sparse data problems.. use probablistic
backoff? Use an HMM.
[11/09/00 01:35 PM]
> LDC
LDC maintains:
* data
* programs
* standards and best practices
Members have perpetual rights to each corpus released in the year in
which they join.
~8 full time and 30 part time transcribers.
LDC functions to:
* distribute/publish data
* create corpera
* research (best practices)
>> Switchboard Corpus
* telephone recordings
* 2430 conversations @ 6min each
* 3 million words, >500 speakers
* transcribed and word-aligned
>> ACE -- automatic content extraction
* Identify nominal entries in a news story
* Classify according to type
* Co-index mentions of a single entity within the story.
>> TDT
* newswire, broadcast radio, broadcast tv
[11/28/00 01:33 PM]
> Unisys: Natural Language Understanding
(Bharathi Palle)
- IVR: Interactive voice response
- ASR: Automatic speech recognizer
- TTS: Text to speech engines
- NL understanding engines
Types of speech applications:
- command and control
- dictation
- dialogue-driven (mixed initiative)
>> ASR
- feature extraction
- phoneme recognition (based on pre-defined acoustic model)
- phoneme-based HMMs/Viterbi search on features for words..
- use grammars to reduce ambiguity/search space
>>> ASR features
- hardware vs. software
- speaker dependant vs independant
- continuous vs isolated vs digit/spelling
- grammar formats
- language models (incl. land-line vs cell-phone)
- vocab size
- n-best
- programming interface
>> NLI
- normalize different inputs to a simple grammar, e.g. normalize "I
want info" to "GET_INFORMATION"
> Darpa Communicator (Lockheed Martin)
Apply research projects to DOD-type engineering applications.
Middle-man between researchers and govt..
MIT: Gallaxy II? TINA?
staged approach
> CL across govt, industry, & academia
Birth of "Language Engineering"
Plug-and-play
> Journals
- Computational Linguistics
- Natural Language Understanding
- Studies in NLP
[12/05/00 01:33 PM]
> Exam Topics
- Materials from the table in the course homepage
- lecture notes
- readings
- assignments
- (no python)
- (not Brent & Cartwright, articles in parens)
- Questions involving computation
- e.g.:
- Compute relative etropy of x wrt y
- Apply Baye's theorem
- Demonstrate Viterbi algorithm for a given HMM & output
- Use cosine measure, given doc. stats.
- Compute avg. precision score for ranked retrieval set
- Questions involving prose
- e.g.:
- Why is NLP hard?
- What is the sparse data problem & hos is it addressed in
language modeling?
- What are collocations, and why is bigram freq. a poor way to
find them?
- Sentence X contains a structural ambiguity -- draw tree
diagrams and discuss.
- Suppose you want to solve problem X -- how would you integrate
components we've talked about to solve the problem?
- Describe a lexical resource and discuss different ways it
could be used in NLP.
- How can linguistics benefit from statistical NLP?
- Questions Between
> Humanitarian applications of NLP
- ubiquitous computing (?)
- second language learning
- alternative input/HCI
- language documentation/preservation
- 52% languages spoken by <10,000 people
- 28% languages spoken by <1,000 people
- need to documet/preserve languages