*
* Lecture notes by Edward Loper
*
* Course: CIS 630 (Machine Learning Seminar)
* Professor: Fernando Pererez
* Institution: University of Pennsylvania
*
> Logistics
Office hours: wed before class
>> Project
write a summary/somehting that will stand the test of time;
or write a good implementation of something; or design a testing
framework for a domain; etc.
> Models for Machine Learning
>> Learning Tasks
 Classification (documents, configs)
 Segmentation/tagging/extraction
 Parsing
 Inducing representations (unsupervised)
First few classes  document classification.
>> Questions
 Generative or discriminitive?
 Handling small sample sizes  sparse data problem
 sequences: local or global methods?
 Does unsupervised learning help?
>> Generative Models
Destimate p(x,y)
Easy to train; robust; probablistic \to gives you a way to think about
it.
>> Discriminitive Models
minimize \sum[[f(x_1) \neq y]]
i.e., try to build a function f predicting y given x, and minimize the
numer of training errors.
Estimate p(yx)
 Focus modeling resources on instancetolabel mapping.
 Avoid restrictive assumptions (probablistic). No need for
explicit model of the domains. In particular, no statistical
independance assumptions.
 Optimize what you care about
 Higher accuracy
>> Global Models
(typically EEs)
 Train to minimize labeling loss
\Theta = argmin(\theta) \sum i Loss(x[i], y[i]  \theta)
 Computing the best labeling:
argmin(y) Loss(x,y\Theta)
 Efficient minimization requires:
 A "common currency" for local labeling decisions  how to
decide about tradeoffs? Use probability, so we can compare
different things.
 Dynamic programming algorithm to combine local decisions
(viterbi)
 principled
 can compose models
 efficient optimal decoding (usually)
>> Local Models
(typically machine learning)
 Train to minimize persymbol loss in context
\Theta = argmin(\theta) \sum i\sum j Loss(y[i][j]  x[i], y[i][k]; \theta, k\neq j)
 Wider range of models
 more efficient training
 heuristic decoding is like pruning
[09/12/01 04:33 PM]
> Generative vs. Discriminitive
Generative: generates instancelabel pairs
 process structure
 process parameters: constrain/define the nondeterminism
How do you deduce the structure?
How do you estimate the parameters (from training data)?
>> model structure
 decomposes generation of instances to elementary steps. we don't
want to generate entire documents, they all have very low
probabilities. some model mapping btwn documents and smaller
steps.
 define dependencies between steps
 parameterize dependencies
Generating multiple features with an HMM: the problem is that the
features are not conditionally independant
>> Independance/intractibility
 trees are good: each node has a single immediate ancestor, so join
probability can be computed in linear time.
 but that forces features to be conditionally independant, given
the class.
 unrealistic that they're independant. e.g., "san" and "francisco"
>> Discriminitve models
 p(yx;\theta)
 binary classification: define descriminant:
y = sign h(x;\theta)
 train \theta to maximize p(training data), or minimize p(error)
[09/17/01 04:29 PM]
>> Classification Tasks
 Document classification: interested in ranking
 ranking function is important to final outcome..
 Tagging
 Syntactic decisions (e.g., attatchment)
>> Document Models
# t = a term.
# C_t(d) = term frequency of t in d.
# d = total words in the document
# d = document
# D = document collection
>>> Binary Vector
Define a binary feature vector. Each term feature is either on or off
for a document. (present or absent)
# f_t(d) = t \in d
>>> Frequency Vector
Define x_t using TF*IDF\ldots
# x_t(d) = F_t(C_t(d)/d)
F_t is a TF*IDF weighting functions:
# F_t(f) = log(f+1)log(1+(D/d\in D:t\in d))
Use logs to squash F_t, because of burstiness.. first occurance is
meaningful, subsequent occurances less so.
Very sparse representation. Use equiv. representations?
>>> Langauge Model
Ngram model
# p(dc) = p(\mid d\midc)\Pi(i=1, \mid d\mid, p(di  d1, d2, \ldots, di1, c))
# p(di  d1, d2, \ldots, di1, c) \approx p(di  din, \ldots, di1, c)
>> Term Weighting and Feature Selection
We want the most informative features. Feature selection:
 remove low, unreliable counts
 mutual information
 information gain
 etc
TF*IDF tries to solve the same problem: adjust term weight by how
documentspecific it is.
>> Kinds of Classifiers
>>> Generative
 Binary naive Bayes
 Multinomial naive Bayes (unigram)
 General classconditional language model (Ngram)
>>> Discriminitive
 Binary features: exponential, boosting, winnow +
embedding into real vector space
 Real vectors: rocchio, linear discriminant, SVM, ANNs
Real vector techniques are more general than binary features...
>> Learning Linear Classifiers
>>> Rocchio
Average vector representations for positive and negative classes.
# a = \alpha(\sum(x\in c)x_t)/c
# b = \beta(\sum(x\notin c)x_t)/(Dc)
# w_t = max(0, a b)
On average, the negative examples overwhelm the positive in b.. So
resistant to using a subclass of the positive numbers..
>>> WidrowHoff
Iterative approach. Estimate gradient, and use it to update our
weights. \eta is our learning weight. Move the model in the direction
of making the actual label agree with the predicted label.
# w_i+1 = w_i  2\eta(w_i*x_iy_i)x_i
y = actual label
w = predicted label
>>> (Balanced) winnow
Faster approach to iterative estimate of classifier weights.
Competing positive error and negative error.
Favors sparse representations: terms go to zero quickly, because it's
multiplicitive rather than additive.
> Naive Bayes (Anne)
>> Bayes Nets
Encodes dependence & independance relationships
Sparse representation of the entire PDF, given that there aren't too
many dependencies.
>> Using Bayes Nets for Classification
Simply compute P(CX). But if X is highly dimensional, this is very
difficult to find. E.g., if X is a binary vector of whether each word
occured (in document classification).
Use Bayes rule (including independance) to calculate backwards..
# P(cx) = P(c)P(xc)/P(x)
Naive Bayes: assume conditional independance between multiple
occurances of a word (between multiple features).
Room for improvement:
 Can we use dependance information to improve effectiveness of
Naive Bayes classifiers?
 Modifications of feature sets?
 Better text representations?
>> McCallum & Nigam
Naive Bayes: just use presence/absence
Multinomial: use multiple occurances..
Compare multivariate/multinomial..
Results:
 multivariate Bernoulli handles large vocab poorly
 multinomial event model more appropriate for classification with
large, overlapping vocabs
>> Sahami (sp?)
We want something between naive bayes and accounting for all
dependencies. Find mutual information between class and features.
Add features one at a time. Connect each new node to k of the nodes
that you've already added. (Pick the k with the highest mutual
information). Try k values of 0..3, and use a threshold: have > (?)
>>> Flat Classification
 one routine examines documents, classifies them
 large number of features (1000s)
 computationally expensive
 overfitting (& sparse data problems)
>>> Hierarchical Classification
Multitier classifier
1. select features (given the data)
2. supervised learning creates the classifier for each tier
Reduces both total number of features, and the number of features used
locally.
Hierarchical classification helps. :)
[09/19/01 04:37 PM]
> Documents vs. Vectors
 Many documents have the same binary/freq vector
 Document multiplicity must be handled correctly
 Multiplicity is not recoverable
 document probability
# p(dc) = p(d\mid c) = \prod p(di\mid c)
Do you mean probability of a document given a class, or the
probability of a count vector given a class? If so, we must add
multinomial coefficients. (Add factorials)
# p(rc) = P(Lc)L! \prod p(tc)^r_ir_i!
 Use bayes rule for classification. When we cancel things, the
multinomial coefficients cancel.
[09/19/01 04:47 PM]
> Maximum Entropy Modeling
(Eugen Buehler)
>> Entropy and Perplexity
 Entropy:
# H(p) = \sum p(x) lg p(x)
 Perplexity:
# 2^H(p)
>> ME
Given what we know, find the probability distribution that
maximize entropy.
What we *do* know will reduce the entropy of the system
Choose the p dist that matches what we know, without assuming anything
we don't (which is done by maximizing entropy)
Assume a set of features:
# fi: \epsilon > {0,1}
Constrain expectation of the feature function under the probability
model p:
# \sum p(x)fi(x) = \sum pbar(x)fi(x) = (1/T) \sum fi(x)
A unique solution exists, with exponential form:
# p(x) = 1/Z exp(\sum\lambda i fi(x))
Z = normalization, \lambda i = parameters
This is just the MLE for out training data (though it's not easy to
prove  information geometry)
Conditional ME:
# p(yx) = 1/Z(x) exp(\sum\lambda i fi(x,y))
Another take on the MLE thing.. these are equivalant:
1. assuming we're exponential, how close can we get to the
constraints?
2. assumin we have a distribution with these contsraints, how close
can we get to the uniform distribution?
>> Solving for lambdas
 generalized iterative scaling (GIS)
 improved iterative scaling (IIS)
no closedform solution, so use hillclimbing techniques. The
hillclimbing technique you use can add constraints, like binary
features, features must sum to a constant, etc.
>> Building ME models
To build an ME model:
 phrase the problem as a prob dist
 design a set of relevant features (!)
>> Text Classification: Nigam et al.
 objective: Find p(cd) (calss given document)
 use one feature type: a weighted word count
# f_w,c'(d,c) = N(d,w)/N(d) if c=c', otherwise 0
 feature selection?
 one approach: include all features, let the model work it out
 but no feature selection is bad: it can result in overrating
very rare features. If xyz appears in 1 document in a corpus
of 100, ME will say that P(xyz)=1%. (this is basically a case of
overfitting)
 so if there's no feature selection, you need smoothing. Assume
gaussian distribution, centered on zero (no effect). Maximize
posterior probability, not P(training data). we can use
heldout data to estimate variance of gaussian
 results:
 compared to multinomial naive bayes
 better on 2/3 tests. (not particularly impressive)
 no feature selection; could include more features for ME, that
are not available for naive bayes
>> PP Attatchment: Ratnaparkhi, et al.
"I saw [a man in the park] [with a telescope]"
Reduce to {V=saw, N1=man, P=with, N2=telescope}, try to predict
whether N2 should be attatched with N1 or V
Result of 0 if you attatch to N1, 1 if you attatch to V
Features have a value of 1 for noun attatchment, 0 for verb
attatchment. But that's ok. P(noun) = 1P(verb).
ME doesn't assume independance of features
(but GIS, IIS converge faster the more independant they are)
Feature space = compositions of binary questions:
 about identity of tuple members
 about class of tuple members
Feature selection
 select best feature based on an estimate increase in
loglikelihood
 train new model
 add a special set of candidate features, related to the new feature
 repeat
Binary outcome conditional ME is equivalant to logistic regression.
So they should have compared to stepwise logistic regression (this
tests feature selection). stepwise logistic regression is basically
logistic regression with feature selection
>> Berger et al. and translation
 build ME as supplements to a frenchenglish MT system
 try to find p(yc), between languages, for a given word\ldots e.g.,
should we translate "in" as "dans" or "en" etc..
 feature sets: test for a given word (this is basically the a
priori probabilities.. e.g., in\to en 25% of the time). Also,
check for immediate following word, immediately preceeding word,
word x is in the 3 preceeding words, word x is in the 3 following
words.
>>> Feature Selection
 A set of candidate features F
 An empirical distribution p
 A set of active features S (initially empty)
 The current model, p[s] (initially uniform, since S is empty)
 For all candidate features, find the parameters using IIS,
then compute gain in likelihood of training data.
 select that feature
 when new feature does not improve performance on heldout data,
we're done
Problem: IIS is slow, so this training method is slooowww.. :)
>>> Estimating likelihood gain
Instead of calculating exact likelihood gain, estimate it:
 during IIS, keep all parameters equal to original model, solve
only for the new parameters (i.e., assume independance)
 this makes it computationally feasable
[09/24/01 04:31 PM]
> Maximum Entropy Review
>> Conditional Maxent Model
 multiclass
 can use diff features for different classes
> Duality
 Maximize conditional log likelihood, given model form
 Maximize conditional entropy, subject to the constraints
> Relationship to (binary) logistic discrimination
If we reduce this to the binary case, then we have a logistic
regression problem. So maxent is a generalization of logistic
discrimination
> Relationship to Linear Discrimination
 Decision rule
# sign(log(p(+1x)/p(+1x))) =
# sign \sum[k] \lambda[k]g[k](x)
 Bias term: parameter for "always on" feature  allows the
discrimination to not go through the origin.
 Question: relationship to other trainers for linear discriminant
functions.
> Solution Techniques
>> Generalized Iterative Scaling
 parameters updates
 additive updates
 initial values? use zero. The more dependant the features, the
longer it takes to converge. If we start at zero, we will
converge eventually; if we start somewhere random, we might go in
circles if features are linearly dependant.
 requires that features add up to a constant independant of
instance or label  use a "slack feature"
>> Improved Iterative Scaling
 multiplicitive updates
 for binary features reduces to solving a polynomial with positive
coefficients.
 Reduces to GIS if feature sum is constant
>> Another approach:
 use standard convex optimization techniques
 conjugate gradient, etc.
 converges faster?
> Gaussian Prior
 If we have a gaussian prior, we can tweak IIS to update according
to variances.. (?)
> Representation
 fixedsize vs variablesize instances.
 multivalued features
[09/24/01 05:07 PM]
> AdaBoost and Variants
Andrew
 Boosting: take several "weak" predictors and combine them to make
one "strong" predictor.
 "Weak" means only slightly better than random. We can use
stronger "weak" predictors, but we don't need to..
>> Weak learner
Consider a weak learner h:
# h : x \to {0,1}
>> Motivation
When we train a classifier, some training samples are "harder" than
others. One approach: take hard ones, duplicate them, and train
(places emphasis of learner on the hard observations).
Boosting is like this:
# 1. Train classifier on h
# 2. Cake copies of hard observations
# 3. Go to 1
At the end, combine all of these somehow.
>> Initial Observation weights
Initially, use uniform distribution:
# i = iteration
# m = number of observations
#
# Distribution D[1](i) = 1/m
Boosting loop
# For T iterations:
# Generate clasifier h[i]
# Choose reweight term a[t]
# Calculate
# Update
>> Error bound on test data
 VCdimension is a measure of the complexity of a hypothesis
space.
 We can put an upper bound on the probability of
misclassification.
Boosting seems to be resistant to overfitting. :)
>> Multiclass/MultiLabel
multiclass: ternary decision, etc.
multilabel: each observation can have a variable number of classes.
E.g., a document might have multiple document categories.
For multiclass, we can have one binary classifier for each class, and
put them back together afterwards.
Two views:
 we are concentrating on the decision boundry. This is a good
thing, cuz we get better classification.
 we are concentrating on outliers, and mangling our model to
accomodate them.
For labeling: if you get too much label noise, then the algorithms
start overfitting horribly.
[09/26/01 04:29 PM]
> Review of Boosting
Training instances:
# x[i] is training instance
# y[i] is label: {1,1}
# (x[1],y[1]), \ldots, (x[m],y[m])
Start with uniform distribution:
# D1[i] = 1/m
For t = 1, \ldots, T:
 train weak learner using Dt
 get weak hypothesis h[t]: maps instances\to labels
 choose \alpha[t] (real)
 update the distribution:
# D[t+1][i] = D[t][t] e^(\alpha[t]y[i]Ht(x[i]) / Zt)
Where y[i] \in {1,1} and Ht(x[i]) \in {1,1}
# H(x) = sign(\sum[t] \alpha[t]h[t](x))
# \alpha[t] = 0.5 ln( (1\epsilon[t])/\epsilon[t] )
We can bound our error by:
# \prod[t] Z[t]
# \epsilon < P[margin(x,y) \leq \theta] + \Theta(sqrt(d/(m\theta^2)))
(\Theta is order)
> SVM
[josh]
Look for a linear separating hyperplanes. There are infinite such
planes. Which one should we use? We can write each hyperplane as a
linear combination of vectors (plus a const).
# f(x, \alpha) = (w[\alpha] \cdot x) + b
If we just pay attention to the minimum margin, we don't really care
about the margin of the points we classify well anyway.
support vector = one of the vectors that we're using to define our
hyperplane.. the distance from all of the support vectors to the
hyperplan is 1..
We can expand svm into additional dimensions, using a mapping
function. if we pick our mapping function carefully, then we can
avoid a lot of computation. For example, project (x1,x2) into two
dimensions:
# \Phi() =
Then
# \Phi(u) \cdot \Phi(v) = (u\cdot v)^2
"Kernel" combines projecting & combining. So it behaves like inner
product, but it's acting via a higher dimension
[10/01/01 04:31 PM]
> SVM (continued)
Features are only refered to indirectly, via the support vectors.
This makes the machine less dependant on the number of features.
SVMs tend to yield high accuracy.
VC Dimension \to the fewer the support vectors, the smaller the VC
dimension. Prediction: accuracy will be higher if VC dimension is
smaller.
In SVM, the kernel allows the mechanism to access features that may
not be available elsewhere..
[10/01/01 04:43 PM]
> Solving LargeMargin Problems
>> Linear Classification
 Linear discriminant function
# h(x) = w \cdot x + b = \sum w[k]x[k] + b
>> Margin
 Instance margin: \gamma[i] = y[i](w*x[i] + b)
Either positive or negative
 Normalized (geometric) margin (positive/negative)
 Training set margin \gamma = min(geometric margins)
 Assume functional margin is fixed to one.
>> Why maximize the margin?
 \exists c s.t. for any data distribution D with support in a ball of
radius R and any training sample S of size N drawn from D..
>> Convex Optimization
 Constrained optimization problem

>> URLs
 www.kernelmachines.org
 www.supportvectors.net
[10/10/01 04:32 PM]
> Learning Theory\ldots
>> Statistical Learning Theory
Form: If problem is in a given complexity class, then with high
probilibility, we can bound our error by some function of the number
of training examples.
But that doesn't tell us about how hard it is to do computationally:
finding the class with very low error may be intractable.
Statistical Learning Theory tells you what's possible, not what's
computationally feasable.
>> Definition of PAC
PAC = Probability Approximately Correct
Incorperates what's possible with what's computationally feasable.
# C: class of concepts
# concept \equiv X \to {0,1}
World chooses a concept for us, and a distribution over the data:
 c \in C
 D \subset X \times {0,1}
We could also define it such that there is a noise distribution that
corrupts labels.
# h \in C is a hypothesis
Then
# P(error(h) \leq \epsilon) \geq 1\delta
where we pick \epsilon and \delta
\exists algorithm to find h, that is polynomial in (1/\epsilon)(1/\delta)
Most results from PAC are negative: we cannot do it.
book.. Kearnes & Vazarani: An intro to Computational Language Theory
> Using Unlabeled Data
 Labeling is expensive: manual
 Unlabelled instances are easy to find: web pages etc
 Unlabeled data is useful
 Joint pdfs of unlabeled data
 merge 2 views of 1 example
>> Basic approaches
 cotraining
 exploit 2 views
 combination of EM and NB classifier
 exploint joint PDF of unabeled data (joint btwn features)
>> CoTraining
 task: find faculty member pages
 two training sets for labeled pages:
 text pointing to the document
 text inside the document
 labeled examples are expensive
 unlabeled pages are easy to get
 reduce necessary labeled data by using feedback btwn 2 views
Bootstrapping:
 train weak predictors A and B from training data
 use weak predictor A to find new training data for B; predictors
for B to find new training data for A
 repeat
Compatibility assumption:
 All labels on examples with nonzero probability under distribution
D are consistant with some target function f_i \in C_i, i=1,2,\ldots
 For any example x=(x_1, x_2) observed with label L:
f_1(x_1) = f_2(x_2) = L = f(x)
 D assigns probability zero if f_1(x_1) != f_2(x_2)
 In this case, (f_1,f_2) is compatible with D
 (C_1,C_2) is of high complexity, while compatible target concepts
might be much smaller.
Compatible concept = a concept with no crossedges.
>>> Bitartite graph
We have 2 types of lines connecting the sides of the graph: labelled
instances and unlabelled instances. Propagate from labled instances
to unlabeled ones. 2 issues:
 what if you can propagate from + to ? (contradiction)
 what if you can't propagate to an edge? (no label)
>>> Application: WSD
within a document, it is very likely that all instances of a given
word have the same sense. (well, kinda. verb/noun meanings of the
same word? etc.)
>>> PAC Analysis: Rote Learning
 Assume X_1=X_2=N, C_1=C_2=2^N, all partitions consistant with
D are possible.
 Output "I don't know" when you can't derive a label from
training/consistancy.
 O((log N)/a) unlabeled examples are sufficient
A more robust approach: minimize an objective function that includes
the errors of each learner on its own training data, plus the
disagreement between the learners on unlabeled training data.
>> Text Classification using EM & unlabeled data
 Unlabeled data provides information about the joint PDF over words
 "homework" tends to belong to the positive class L
 Use this fact to estimate the classification of
unlabeled documents, and get a new positive class
L'
 L' gives us "lecture" (cascading effect)
Technique:
1. Train classifier with labeled documents
2. Use classifier to assign probablisticlyweighted class labels to
each unlabeled document by finding expectation of the missing
labels. (E)
3. Train a new classifier using the documents (M)
4. Repeat 2/3 (E/M) until convergence
>> Generative Model
2 assmptions:
 document is produced by mixture model
 onetoone correspondance between mixture components and classes.
[10/15/01 04:26 PM]
EM Loop:
Expectation: use current classifier to estimate component membership
of each unlabeled document
Maximation: reestimate the clastifier, given the component
membership of each document. Use a maximum a posteriori
probability estimation to find argmax\lsemantics\theta\rsemantics P(D\theta)P(\theta)
Helps more if we don't have enough labeled docs
>> Augmented EM
 Mixture components are not in correspondance with calss labels.
 Give different weight to the unlabeled data
 Multiple mixture compoenents per class
[10/15/01 04:47 PM]
> Expecation Maximation
web: Convexity, Maximum Likelihood and All That (Adam Berger)
>> Motivation
 Hidden (latent) variable models
# z = unobserved variables
# Creates a larger class of models to fit our data
# p(y,x,z  \Lambda)
Work with marginal distribution:
# p(y,x  \Lambda) = \sum p(y,x,z\Lambda)
topics as a hidden variable:
# class generates\to topic generates\to words
Examples:
 Mixture models
 Classbased models
 HMMs
generalize: E/M
>> Maximizing Likelihood
 Data loglikelihood
 D = {(x_1,y_1), \ldots, (x_n,y_n)}
 L(D\Lambda) = \sum_i log p(x_i,y_i\Lambda)
Find parameters that maximize (log)likelihood
Use a regularizing term to keep \Lambda closer to a prior distribution?
>> Convenient Lower Bounds
 Convex function
 Jensen's inequality
# f(\sum p(x)x) \leq \sum p(x)f(x)
# i.e., f(E(x)) \leq E(f(x))
(where f is convex, p is a pdf)
 Find a lower bound function that touches the function we want (at
p). Maximize function to p_m_a_x, and then repeat with p=p_m_a_x
 Better than gradient asecent, since we don't need to worry about
step size: sinced it's tangent, we're going up. since its a lower
bound, we're stil below. since it's pmax, the corresponding point is
higher than p. (no danger of having too large of a step size, like you
have with gradient ascent)
>> Auxilliary Function
 Find a convenient nonnegative function that lowerbounds
likelihood increase.
 L(D\Lambda')  L(D\Lambda) \geq Q(\Lambda',\Lambda) \geq 0
 maximize lower bound
 \Lambda_i_+_1 = argmax_\Lambda_' Q(\Lambda',\Lambda)
# 1 p(zy)
#  = 
# p(y) p(y,z)
# p(zy) = p(y,z)/p(y)
so:
# p(zy,x,\Lambda) = p(y,z,x\Lambda)/p(y,x\Lambda)
Start with log(\sum_i(\ldots\Lambda_i\ldots))
Convert to \sum(log(\ldots\Lambda_i\ldots))
now we can maximize for each \Lambda_i (since the \sum compoenents are
independant  derivitive of a sum is the sum of the derivitives). So
maximize each log(\ldots\Lambda_i\ldots) independantly.
>> Algorithm
\Lambda_0 \gets carefully chosen starting point
repeat to loglikelihood conergence:
 E step: compute Q(\Lambda'\Lambda_i)
 M step: \Lambda_i_+_1 \gets argmax_\Lambda_' Q(\Lambda'\Lambda_i)
>> Comments
 Likelihood keeps increasing but:
 can get stuck in local maximum (or saddle point)  doesn't
usually occur in practice.
 can oscillate between different local maxima with the
same loglikelihood
 If maximizing the aux function is too hard: find any \Lambda that
increases likelihood: generalized EM (GEM)
 Sum over hidden variable values can be exponential if we're not
careful.
>> Mixture Model
 base distributions: p_i(y)
 mixture coeff: \lambda s
 p(c,yLambda) = \lambda_cp_c(y)
 auxilliary function
 \sum_y p(y) \sum_c p(cy,\Lambda)log (p(y,c\Lambda'),p(y,c\Lambda))
to do soon:
 prepare cis630 lecture
 write problems for cis530 exam: tagging + mylecs
 fix line tokenizer, repl with '{\textbackslash}n{\textbackslash}n' tokenizer (re)
 fix tutorial, pset
 pick fscore cutoff
conventions:
 repr = standard repr; str = verbose repr (can be multiline)
pp = pretty print (usu. multiline  takes right/left args)
 exception use?
 type checking
 equality/ordering comparisons
 immutable \leftrightarrow hashable
[10/22/01 04:41 PM]
> Projects
>> Java Implementation
Do a Java implementaiton of some of these techniques that is:
 extensible
 easily modifiable
 etc.
 since we're using a higher level language, we can make things
simpler.
 more emphasis on speed than my project
 nearest neighbors, svm, winnow. NB
>> Text Classificaiton & Nigam
Jean
 Given a set of classified documents, build a statistical
model p(cd), the probability given a document that it belongs
to a class c
 Nigam et all use one feature type: frequency of a word. These
add up to 1, which is very convenient.
 Results in as many as 57k features (no feature selection)
 No feature selection tends to create overfitting
 Address this in hindsight by saying that paramters should have a
gaussian distribution.
 Maximize the posterior P rather than the P of training data
 Wider set of features: phrase counts
 N(p,d)/N_p(d)
 features sum to one.
 define phrases in complementary fasion? "computer science"
doesn't count as "computer" or "science" alone.
>> A Comparison of Text Classification Algorithms
Survey. Not much implementation.
Corpus: hungarian newswire articles
 9k news articles, 9 channels
 keywords: 13.8k keywords, 33k occurances
 task: assign a channel or keyword to new articles
>> Template Relations Task
 Task for MUC7
 TRs express domainindependant relationships between entities
 TR uses LOCATION\_OF, EMPLOYEE\_OF, PRODUCT\_OF.
 *Nance*, who is a paid consultant of *ABC News* \ldots
 Answer key contains entities for all organizations, persons,
artifacts that enter into these relations
 Training data: 500kb, 1k entities, 1k relations
 Most relations are local (e.g., appositive)
 Best results: 74% precision & recall
 Project
 Incorperate syntactic features (shallow parsing, or
XTAG supertags)
 Use discriminitive classifier
>> Using Machine Learning in Anaphora Resolution
NaRe & Cassandra
 NLP system must provide "interpretation" for NP.
 Pronouns
 Use classifier: they are or are not correferential
 If you get 0 for both or 1 for both, fail.
 But we really want ranking: competition between antecedants.
 Try to recast maxent as a ranking method.
 Rank using likelihoods
 Experimental results: less than spectacular
 Use Collins descriminitive reranking ("desciminitive reraking for
NLP" (2000))
>> Modelling author communities
Papers with text & ciations. We know what year each paper is from.
General problem: see how different intellectual communitites evolve
over time.
There's a bunch of hyperlink analysis etc to cluster points that you
can call communities..
People in the same community use similar language..
[10/25/01 04:37 PM]
>> Benchmark Comparison of the Aspect Model
>> and Mixtures of Naive Bayes
! Andrew Schein
 with em, use totally unlabeled data, see what the model gives..
 goal: model the probability distribution of a person reading
a document:
# P(p,d)
Find the probability that the person has read the document.
 Use these probabilities to recommend documents to read
 2 paraatermizations. basically like using 2 distributions:
 mixture of naive bayes
 aspect model
 aspect model ("latent variable model")
 observation = (person, document)
 person associates with multiple classes
 assume that each observation is generated by a single class,
but one person has multiple classes (mixture model)
 mixture of naive bayes
 person belongs to a single class
 c.f. autoclass
 dataset:
 movielens: what people watched what movies
 ~1k people, each recommending ~20 people
 ~2k movies
[10/25/01 04:50 PM]
> Sequence Modeling
 Assign a labeling to a sequence
 story segmentation
 POS tagging
 shallow parsing
 named entities
 global models
 train to minimize overall labeling loss
 local models
 train to minimize persymbol loss in context
 for each symbol, find best label given a hypothesized context.
 generative vs discriminitve
> Information Extraction with HMMs and Shrinkage
 IE: automatic extraction of subsequences of text (e.g., extract
location or time of a meeting)
 apply shrinkage to HMMs
 Task:
 given a model & parameters, figure out sequence of states
 use viterbi
 use HMMs with topology set by hand. there are target states
(generate text we want to extract) and background states. (only
one target state, the rest are background states)
 shrinkage combines estimates with a weighted average and
learns the estimaties with EM.
 shrinkage hierarchy configurations:
 none
 uniform: all distributions are shrunk towards uniform
 global: all target states & nontarget states are shrunk
toward a common parent
 hierarchical: some states are shrunk towards different
states
 local estimates calculated from ratios of counts
 find improved estimate for P(ws_j)..
 estimating weights (use EM)
 initialize uniformly
 find degree to which each node predicts words
 derive improved weights
> Named Entity Restriction with HMMs
 Task: identify names, locations, etc.
 Labels: entities, times, numerics
 start with handbuilt network, model both names & locations
 find the most likely sequence of classes..viterbi
 2 level model.. high level HMM model, with states that have
bigram models inside them
 words are ordered pairs f = features: twoDigitNum,
fourDigitNum, otherNum, allCaps, capPeriod, firstWord, etc.
These allow us to deal with unseen data
 Results..
[10/29/01 05:33 PM]
> Maximum Entropy Markov Models
> and Conditional Random Fields
 Task: Extract question/answer pairs from a FAQ
 Task: Mining the web for research papers
 Information extraction with HMMs.
 P(ss')
 P(os)
 Problems with HMMs:
 Want richer feature representation
 But Can't have multiple overlapping features
 Naive bayes doesn't work well
 would prefer conditional, not generative, model
Transform:
# Transitional HMM \to Maximum Entropy Markov Model
# P(ss') \to P(so,s')
# P(os)
Think of it as haing:
# P_{s'}(so) = P(so, s')
Each state contains a "nextstate classifier" black box, that, given
the next observation, will produce a PDF over next states.
This is a conditional PDF. We can't find P(os).. We *must* start
with an output, and only then can we predict probabilities.
Conditional model doesn't know the absolute distribution of outputs.
State transition probabilities based on overlapping features
Feature depends on obseration and state:
F_{o,s}(o_t,s_t) = 1 if b(o) is true and s = s_t
Exponential form:
# P(so,s') = 1/Z(o,s) exp(\sum \lambda_{b,s}f_{b,s}(o,s))
Note: we have a separate PDF for each s'. Thus, the notation:
# P_{s'}(so)
Do maxent training on each of these PDFs.
Models tested:
 MEstateless: classify each line independantly with maxent
 TokenHMM: standard HMM generating tokens
 FeatureHMM: convert lines to sequence of features, then
generate them independantly. I.e., naive bayes HMM with
overlapping line features
 MEMM: maximum entorpy markov model
Smoothing? (e.g., for zero probability transitions)
>> Variation 2:
 Observations in states instead of transitions
 n^2 contexts (for n states): increased sparseness
 Do P(ss',o) = P(ss') * maxent..
>> Summary
 New probablistic sequence model based on maxent
 arbitrary overlapping features
 conditional model
 positive results
>> Label Bias Problem in Conditional Sequence Models
Example:
# __\longrightarrow1 \longrightarrow 2 __
# 0 __ __> 5
# \longrightarrow3 \longrightarrow 4 
# 0\to1 r
# 1\to2 i
# 2\to3 b: rib
# 0\to3 r
# 3\to4 o
# 4\to5 b: rob
P(pathobservations)
# P(1,2ro) = P(1r)P(2o,1)
# = P(1r) 1
# = P(1r) P(2i,1)
# = P(1,2ri)
# P(2,o,1)
# P(2o,1) =  = 0/0
# P(o,1)
Because 1\to2 is a forced choice.. So P(2*,1)=1 for any *, since 2 is a
forced choise from state 1.
 Biases towards states with fewer outgoing transitions (esp
deterministic states)
 Perstate normalization does not allow the required property:
# socre(1,2ro) << score(1,2ri)
Determinization:
 not always possible
 statespace explosion
Fullyconnected models:
 lacks prior structural knowledge
Their solution: conditional random fields
Suppose there is a graphical structure for Y.
# G = (V,E)
# Y = (Y_{1}, Y_{2}, ..., Y_{V})
Define:
# p(YX)
# X = input observations
Probability of a node is dependant on the entire input and the element
that points to it.
 With an HMM, we can only encode history of the input with expanded
states.
 With CFRs, a feature can depend on the entire input, so it can
encode something about the input history much more easily
 Try using conjugate gradient instead
[10/31/01 05:34 PM]
> Combining Models ot Improve Tagging Performance
! Andy
>> Boosting Applied to Tagging and PP Attatchement
! Abney, Schapire, Singer
>>> Boosting
 Train a series of weak learners h_t(x_i)
 At each iteration t, reweight training examples to emphasize the
hard examples.
 After training all T learners, build a finalclassifier:
H(x) = sign(\sum\alpha_th_t(x))
 h_t are given weight according to their performance (\alpha_t)
 n.b. 2 weightings: one over h_t, the other over training examples
 Updating weights of observations:
D_{t+1}(i) = D_t exp(y_ih_t(x_i))/Z
>>> ContinuousValued learners
 predict probability, not just presence/absence
>>> Weak Learners
 Predicates attribute = value (a=v)
 PreviousWord = the
 Boosting sselects those predicates that produce better
classification accuracy.
>>> Predicate \to Classifier
 Define a predicate \phi on instance x: x \to {0,1}
 p_b is the prediction that \phi(x) = b
 h(x) = p_{\phi(x)}
 p_b = 1/2 ln (w^b_{+1})/(w^b_{1})
>>> Multilabel boosting
 Sometimes, we want more than 1 tag as output!
 Use adaboost.MH
 Find p_b for each class independantly.
>>> Features:
 Lexical attributes
 Contextual attributes
 Morphological attributes
>>> PP attatchment
# I warned [the president of pecedilis]
# I warned [the president] [of pecedilis]
>> Improing Accuracy in Word Class Tagging through the
>> Combination of Machine Learning Systems
! van Halteeron, Daelemans, Zavrel
 gang method  average, voting, etc.
 arbiter method  use a learner to learn which arbiter to use
Why does combining help?
 models may have similar accuracy, but they maybe different
errors.
Ensemble = the combined method
Arcing methods:
 bagging: sample with replacement to build N classifiers, then
combine them.
 boosting
[11/05/01 04:34 PM]
> Project Schedule
 Proposed topic by Oct 10
 Fiveminute proposal Oct 17
 5min Project revies Nov 7th (this wed)
 15 min project presentations Nov 26th and 28th
 final deliverables dec 14th
> Class Schedule
 Information bottleneck: monday before thanksgiving
 !!check this!! 19th?
> A General FiniteState Formalism
Generalize regexps to weighted rational transductions:
 reversible, composable inputoutput patterns
 weighted alternatives
 target for learning algorithms
Sequence models: HMMs, sequence maxent, etc.
 Structure
 Parameter setting
 Learning structure?
>> Weights
Weight semiring: generalize the notion of multiplicity (as in
multisets).
Multiplicity: how many different ways can we recognize a string?
Might include P, might not. Weighting is not necessarily a
probability.
 Sum: compute the weight of an object from the weights of its
possible derivations. Associative, commutative.
 Product: compute the weight of a derivation from the weights of
its steps. Associative, distributes over sum.
 0: 0+x=x; 0*x=0
 1: 1*x=x
>> Regular Transductions vs. Regular Expressions
# Regexp Rational Transduction
# ++
# meaning set of functions from pairs of strings
# strings to weights
# element {a} \lsemantics a:b/w\rsemantics(u,v) (a to b cost w)
#
# sequence \lsemantics ST\rsemantics=\lsemantics S\rsemantics\lsemantics T\rsemantics \lsemantics ST\rsemantics(t,w) = \sum \lsemantics S\rsemantics(r,u)*\lsemantics T\rsemantics(s,v)
# rs=t uv=w
# alternation \lsemantics ST\rsemantics = \cup \lsemantics S+T\rsemantics
#
# Closure \lsemantics S*\rsemantics \lsemantics S*\rsemantics = \sum\lsemantics S\rsemantics^{k}
#
# composition (none) \lsemantics S\circ T\rsemantics(u,w) = \sum \lsemantics S\rsemantics(u,v)*\lsemantics T\rsemantics(v,w)
# v
>> Composition of Weighted Transducers
 Composition rule:
# a:b/u b:c/v
# s \longrightarrow s' t \longrightarrow t'
# 
# a:c/(u*v)
# (s,t) \longrightarrow (s',t')
 Lazy algorithm with optional memoization
>> Learning
 Compile ngram stats, hmms, etc. into this form
 Compile decision trees into transducers
 Compile transformationbased taggers into transducers
 Direct automata learning by state merging
>>> Trainable edit distance
Make weighted transducers to model edit errors.. Train an edit
distance learner..
>>> Determinization
 it's not always possible to determinize a weighted transducer
 Instead of having sets of states, have sets of state/output
pairs.
DAWG = directed acyclic word graph = minimized form of a trie
(retrieval tree).
Start with DAWG, and then merge states.
>> KReversibility
 A kreversible automaton = deterministic, and reversed version of
the automaton is deterministic with lookahead k.
 Means that, if you look back k steps, then you know where you must
have come from.
[11/07/01 04:58 PM]
> Comments for my presentation
 Should feature values be more general?
 Should feature objects have IDs, and FeatureList return IDs?
 Better use of numpy? (behindthescenes stuff)
 Abstract the notion of a feature value list (instead of just a
list?)
 Extraction whee
Instance > FeatureList > FeatureValueList
 what is a "FeatureValueList"? Sequence? Map? We want to be able
to iterate over it..
 Should factory separate train/get\_classifier (with train applying
to a single text)?
>> Basic Classes/Interfaces
 Feature: apply() id() [**]
 FeatureList: detect(), +, len()
 FeatureValueList:
 iterate over (id,val)
 request val for an id?
 LabeledType
 ClassifierI
 ClassifierFactoryI
 FeatureSelectorI
 LabeledFeatureValue
 pdf1: P(LabeledFeatureValueLabel)
 samples = ?? Maybe LabeledFeatureValueList ??
 pdf2: P(Label)
 LabeledFeatureValueProbDist
 samples = LabeledFeatureValueList
 event1 = LabelEvent
 event2 = FeatureValueEvent
 Uses NBProbDist?
 NBProbDist:
 events
 P(inst) = \prod P(event)
 Have a different PDF for each event?
 Apply smoothing on each PDF..?
 Does smoothing apply to prob dists or freq dists??
 Notion of a random variable?
>>> Random thoughts..
 Terminology:
 Feature vs FeatureValue
 FeatureExtractor vs Feature
 FeatureExtractor vs FeatureValue
 FeatureExtractorList (??)
>>> Features
Feaures have the following aspects:
 Feature Extractor
 Feature Value
 Feature ID
How do they relate? Well, FeatureExtractors produce FeatureValues.
Also, each feature has a unique integer identifier. Integer because
that makes it much easier to do things with arrays.
FeatureExtractorList: LabeledText \to FeatureValueList
FeatureExtractorList[FeatureID] \to FeatureExtractor
FeatureValueList[FeatureID] \to FeatureValue
FeatureExtractorList.apply(LabeledText) \to FeatureValueList
>>> Classes
 FeatureExtractor (=class?)
 FeatureValue (=any?)
 FeatureExtractorListI (sparse)
 SimpleFeatureExtractorList
 FeatureValueListI (sparse)
 SimpleFeatureValueList
 ArrayFeatureValueList
What does a FeatureValue contain, other than just the value? Is there
a reason to use a real class/interface, rather than just a value?
You need to be able to iterate through feature value lists.. Have
an items() member or some such? Or assigmnents()? I could even
define a new class:
 FeatureAssignment = \langle FeatureID, FeatureValue\rangle
And have something like:
for fa in feature\_value\_list.assignments():
The alternative is:
for (id,val) in feature\_values.assignments():
The default is *always* zero.
# ++
# FeatureExtractorList
# LabeledText >  extract  > FeatureValueList
#  
# ++
[11/12/01 04:56 PM]
> Probablistic Latent Semantic Indexing
! Thomas Hofmann
(PLSI)
Domain: documents d with words w
Problem: model P(d, w)
Simple solution: MLE
Want: semantically similar words to be similar
Solution: dimensionality reduction
Observation: (w, d)
Associate latent class var z with each (w,d)
Generative:
 select d with P(d)
 select z with P(zd)
 select w with P(wz)
>> Aspect Model
Independance assumptions:
 p(d,w) are independant (bag of words)
 Conditional independance: P(wz,d) = P(wz)
 P(wd) is a convex combnation of factors/aspects P(wz)
Since Z << D, the z layer acts as a "bottleneck" reducing the
space..
Each document has a single mixture of z's.
>>> Training
Maximize log likelihood: P(modeldata)
Use EM
> Latent Dirichlet Allocation
! David Bilie, Andrew Ng, Michael Jordan
[11/19/01 04:39 PM]
> Information Bottleneck Method
 From "information" to "relevant information"
 what is the information content vs. what is the relevant
information content.
 what is the relevant information content?
 illposed question: depends on what we want to know.
 exact text: trditional information theory
 what happened?
 style
 author
 political biases
 etc.
 We want information *about* something
 goals:
 quantify "information about"
 lossy compression of informaiton sources, preserving the
information that we care about
>> Formalization
 observed variable X
 variable of interest Y
 how much information does X have about Y?
 I(X;Y)
Goal:
 summarize X into X~, preserving information about Y.
 Probablistic summarization rule P(XX~)
>> Assumptions
 Summary does not carry info about Y that's not already in X
 Therefore, we have a markov chain, so the following is valid:
 P(x~y) = \sum_x p(x~x)P(xy)
 Fix a given compression rate
 Maximize I(X~;Y)
>> Variational Principle
 Use a lagrange multiplier
L[p(x~x), T] = I(X~;Y)  TI(X~;X)
 Summaries are exhaustive: \sum_x p(x~x)=1
 T=0: no compression
 T=\infty: sketchy summary
 T = \delta I(X~;Y)/\delta I(X~;X)
 T = 1/\beta
> Schedule
>> classes
W 21st: ? (I may be gone)
MW 26th and 28th: ?
first week of december: fernando gone
>>> content
Unknown. :) Maybe more latent variables, or something..
>> project
talking project details with fernando: this week or next.
Final report due: 5pm on Thursday 13th.
Friday 14th and Monday 17th = presentations
[11/26/01 04:33 PM]
> EMBased Clustering for NLP
# 1. Donna read the book
# 2. # Donna read the truck
# 3. # The book read Donna
all syntactic, (2) and (3) are semantically anomolous
Find verbargument clusters
 handcoded lexicon: features (+readable)
 some hidden set of classes
 use EM to find classes
# P(v,n) = \sum_{c\in C} p(c,v,n) = \sum_{c\in C} p(vc)p(nc)p(c)
Equivalant to a probablistic grammar, with rules:
# S \to N_iV_i
# N_i > n_j
# V_i > v_k
Use the insideoutside algorithm. 2word "sentences", so we can do
this in reasonable time.
Since we're doing separate P(vc) and P(nc), we can generalize to new
nounverb combinations.
> A WinnowBased Approach to
> ContextSensitive Spelling Correction
>> Intro
 high dimensional feature space
 target concept only depends on a few features
>> Context Sensitive Spelling Correction
 Problem: spelling errors that result in a real but unintended word
(homophone, typographic, grammatical, crossword boundries)
 Approach: WSD
 Confusion set: set of words that might replace each other
 e.g., {hear, here}
Features:
 Context words (e.g., "cloudy" within \pm10 words)
 captures semantics, topic, etc
 Collocations (pattern of contiguous words and/or POS tags)
 e.g., "___ to VERB" ({weather, whether})
 captures local syntax
>> Bayesian Approach
Baseline for comparison. Naive bayes except:
 no independance assumption: detect strong dependencies, try to
remove redundant ones. This tries to produce a (relatively)
independant model.
 Use smoothing (not just MLE)
>> Winnow Approach
\cong10^1^5 items:
 lowlevel predicates: encode aspects of the current state of the
world (i.e., features)
 highlevel concepts: learned as functions of the lowerlevel
predicates by a "cloud" or ensemble of classifiers (i.e., confusion
sets)
Each confusion set learns its own classifier
Each classifier decides whether a particular word W_i in the confusion
set belongs in the target sentence. I.e., decide whether a given word
"works" in a given context.
>>> Training (1)
Create connections between clouds and features
We have:
 set of active features
 correct confusion set
\to positive example for W_c
\to negative example for W_i i\neq c
Training algorithm:
 Add connection with weight of 0.1 for each new active feature
(for positive example only, not negative feature)
 For each old feature:
 if negative feature, demote weight (multiply by .5<\beta<.9).
 if positive feature, promote weight (multiply by \alpha=1.5).
Problem: not symmetric; if we see a new feature near the end of
training data, it doesn't get affected by demotions for negative
occurances..
>>> Weighted Majority
Several parallel classifier clouds decide whether W_i from the
confusion set belongs in the sentence. Each classifier is given a
weight \gamma based on its prediction accuracy.
# C_j is a classifier (\beta = 0.5 \ldots 0.9)
# m_j = number of mistakes made by C_j
# \gamma = 1.0 and decreases with # of examples seen
# \sum_j\gamma^m^jC_j / \sum_j\gamma^m^j
Use highest activation level to select an outcome
>> Results
>> Conclusions
[11/28/01 04:30 PM]
> Verb Clustering & Ambiguity Resolution
! Alexandrin A Popesoul
>> Clustering Verbs Semantically According to Alternations
! Sabine Shulte im Walde
Cluster verbs into semantic classes based on syntactic info and
semantic info for the nouns associated with the verbs
"[Verbs can be semantically classified according to their syntactic
alternation behavior concerning subcat frames and selectional
preferences for args within frames]"
Yay for Levinesque alternations!
>>> Alternation Behavior
 Syntactic subcat frames
 Semantic WordNet classes
subcat frames: the way that verbs combine with args to form VPs.
(focus on objects?)
Refine subcat frames with noun semantic classes: what semantic classes
of nouns can they take?
Use WN synsets & hypernyms to group noun phrases
 selectional preferences
corpus: british national corpus (5.5mil sentences)
 frames that appear at leas 2k times (88 frames)
 restrict potential WN classes to 23 nodes
Task:
 cluster 153 manually chosen verbs
 226 senses, 30 handtagged classes
 use levin's classification for evaluation
>>> Clustering
 agglomerative
 latent class analysis
Input:
 joint freqs of verbs & subcat frames
 frame slot values for nouns
# t = subcat frame
# v = verb
# C = noun class
# P(tv)
# P(t,Cv)
(doesn't use a coherent probablistic model)
Use agglomerative clustering of P(tv)
 start with singleton clusters
 join clusters using something like KLdivergence
 restrict cluster size to 4 or less
 recluster large clusters
 when do we stop?
 expensive
>> Using Probablistic ClassBased Lexicon for Lexical
>> Ambiguity Resolution
! Datlef Prescher et al
> \ldots
>> Problem Description: IPS
Inference Problem:
 combine outcomes of several different classifiers in a way that
provides a coherent inference that satisfies some constraints.
>>> IPS
IPS = identifying phrase structure
 Instance of inference problem
Problem:
 input string O = o_1, \ldots, o_n
 phrase = a substring of consecutive symbols
 goal = identify the phrase in a stream
Learn classifiers that can recognize the local signals which are
indicitive to the existance of a phrase:
 IO model: a symbol is "inside" or "outside" a phrase (variant =
IOB, B = begin a new phrase)
 OC model: a symbol "opens" or "closes" a phrase
We're trying to merge independant classifiers  which makes OC work
better. with IO, there's no state, so classifiers can interfere in
annoying way. OC allows us to capture some notion of state.
Combine output of the classifiers. Respect constraints:
 phrases can't overlap
 probablistic constraints on order of phrases, lengths, etc.
>> General Approaches
>>> Approach 1: Markov Modeling
 probablistic framework that extends HMMs in two ways:
 simple HMM
 projectionbased HMM
Train HMM with supervised learning
Incorperating constraings:
 constrain the state transition probability (e.g., set transition
probabilities to 0 when they are disallowed)
Local signal classifiers:
 NB
 SNoW
 Simple HMM
Incorperate local signal classifiers intto a single HMM framework.
>>> Approach 2: Constratin Satisfaction with Classifiers
CSCL for IPS
 optimal problem
 encode phrases as variables s.t.
# V = E = {e_i  e_i is a possible phrase}
 f = \bigwedge_{ei overlaps ej}(\lnot e_i\lor\lnot e_j) where e_i=1 (0) iff e_i is (not)
a phrase
 cost: c: E\to\setR
Approach:
 use graphical model, find the shortest path.
Two issues:
 find \tau
 polynomial time (graphical method)
 use weights, and find shortest path
 determine cost function c?
 natural definition: c(e) = 1  P(o)P(c)
 use this instead: P(o)P(c)
>> Results
Corpus = WSJ in Penn Treebank
Compare CSCL, HMM, PHMM. Each uses all 3 classifiers
CSCL outperforms PHMM outperforms HMM.
[12/14/01 11:07 AM]
> Document Modeling with Latent Class Models
! Alexandrin A Popescul
Data set:
 documents from citeseer
 "text" and "learning", plus citations
 remove stop words
 porter stemmer
 keep 3k most frequent word (\geq15 tokens)
Authomatically cluster documents..
Model:
\sum_z P(z)P(dz)P(wz)
5 latent classes
Hard clusters.
 Assign each document to z_d=argmax_zP(zd)
 Clusters vary in size..