*
* Lecture notes by Edward Loper
*
* Course: CIS 620 (Advanced Topics in AI)
* Professors: Michael Kearns and Lawrence Saul
* Institution: University of Pennsylvania
*
# http://www.cis.upenn.edu/~mkearns/teaching/cis620/cis620.html
[01/09/02 09:31 AM]
> Markov Decision Process
- State space S
- Actions a, b, \ldots from each state
- Transition probabilities P(\cdot|s,a), P(s'|s,a)
- (local) Rewards R(s,a) \in [0,1]
- Discounted reward: r_0 + \gamma r_1 + \gamma^2r_2 + \ldots
(c.f. finite horizon)
- Policy \Pi: states \to actions
- Planning: given MDP, compute (near) optimal policy
We are assuming full observability, i.e., we always know what state
we're currently in.
(Pset 1: available later today/tonight.)
>> Value Functions
- \Pi = Policy
- Define value function V^\Pi(s):
V^\Pi(s) = expected discounted return starting from s and using \Pi.
- V^\Pi(s) = R(s,\pi(s)) + \gamma(\sum_{s'}P(s'|s,\Pi(s))V^\Pi(s))
- Now, define V^*(s) to be the optimal exected discouted return from
s. I.e., V^*(s) = max_\Pi{V^\Pi(s)}
>> The planning problem
- \exists \Pi^* s.t. V^{\Pi*}(s) = V^*(s)
Argument: if there were a better choice for one time we arrive at
V(s), then we should always do the same thing there..
Planning problem: find \Pi^* (= optimal plan)
- You know the transition probabilities
- You know the rewards
- You want to find the policy
Bellman optimality:
- V^*(s) = max{ R(s,a) + \gamma \sum_{s'}P(s'|s,a)V^*(s') }
- Payoff has 2 components: immediate, and discounted.
- \Pi^*(s) = argmax_a{ R(s,a) + \gamma \sum_{s'}P(s'|s,a)V^*(s') }
- Greedy(V^*)
>> Finding the optimal policy
Try to find a value that makes the bellman optimality equation true:
- V^*(s) = max_a { R(s,a) + \gamma \sum_{s'}P(s'|s,a)V^*(s') }
>>> Value Iteration
Iterate v, an estimate of V^*(s):
- v(s) \gets max_a { R(s,a) + \gamma \sum_{s'}P(s'|s,a)v(s') }
- If v converges, we're done.
v does converge; why?
- Define error \Delta_t = max_s{ |v_t(s) - V^*(s)| }
- \Delta_{t+1} = max_s{ |v_{t+1}(s) - V^*(s)| }
- \Delta_{t+1} = max_s{ |max_a { R(s,a) + \gamma \sum_{s'}P(s'|s,a)v_{t}(s') } -
max_a { R(s,a) + \gamma \sum_{s'}P(s'|s,a)V^*(s') }| }
Note that:
- |max{f(a)} - max{g(a)}}| \leq |max_a{f(a) - g(a)}|
So:
- \Delta_{t+1} \leq max_s{ |max_a { R(s,a) + \gamma \sum_{s'}P(s'|s,a)v_{t}(s') -
R(s,a) + \gamma \sum_{s'}P(s'|s,a)V^*(s') }| }
- \Delta_{t+1} \leq max_s{ |max_a { \gamma \sum_{s'}P(s'|s,a)v_{t}(s') -
\gamma \sum_{s'}P(s'|s,a)V^*(s') }| }
- \Delta_{t+1} \leq max_s{ |max_a { \gamma \sum_{s'}P(s'|s,a)(v_{t}(s')-V^*(s')) }| }
\Delta_{t+1} is greatest where:
- v_{t}(s')-V^*(s') = \Delta_t
Since we're taking the expectation over a pdf, we can do:
- \Delta_{t+1} \leq max_{s,a,s'}{ |v_t(s') - V^*(s')| }
So:
- \Delta_{t+1} \leq \gamma\Delta_t
So if we make t\geq log(1/\epsilon)/log(1/\gamma), then we converge to within \epsilon.
Our optimal policy \pi = greedy(v)
If we use a value function that's close to optimal, then is the
resulting policy greedy(v) close to optimal?
n.b., v and V^\pi are not necessarily the same..
The finite horizon nature of the problem is really what allows us to
put bounds on how close we will be.. our estimation error is
limited..
Time: O(tn^2) for t iterations, n states.
>> Policy Evaluation
Given \Pi, we want to compute V^\Pi(s).
- V^\Pi(s) = R(s,\Pi(s)) + \gamma \sum_{s'}P(s'|s,a)V^\Pi(s')
This is basically just n linear equations, in n variables (variables
are v_s=V^\Pi(s))
v_s = R(s,\Pi(s)) + \gamma \sum_{s'}P(s'|s,a)v_{s'}
Solving n linear equations: O(n^3) time
But it's exact (not approximate): we can exactly evaluate a given
particular policy in n^3 time..
>> Notation..
- V^\Pi(s) = expected return
- Q^\Pi(s,a) = R(s,a) + \gamma \sum_{s'}P(s'|s,a)v^\Pi(s')
I.e., if we ignore policy \Pi for one step, and do a instead; and
from then on, we use policy \Pi, then what expected discounted
return do we get?
>> Policy Iteration
- Iterate over a policy \pi_t
- \pi_{t+1}(s) = argmax_a { Q^{\pi t}(s,a) }
- \pi_{t+1}(s) = argmax_a { R(s,a) + \gamma \sum_{s'}P(s'|s,a)v^{\pi_t}(s') }
How does this differ from value iteration? First, there's no
approximation. I.e., we compute Q values directly.
Convergence
- whenever argmax_a{Q^{\pi t}(s,a)}\neq\pi_t(s), we change our policy
- whenever this happens, it's a good thing (i.e., increases
our value function)
- \exists finite number of policies
- we change whenever we're suboptimal
- i.e., policy iteration will always converge to the optimal
solution in a finite limited time.
>>> Notation\ldots
- P(s', \pi, t) = probability that we're in state s' after t
iterations, according to policy \pi.
- In particular:
- P(s', \pi, 1) = P(s'|s,\pi(s))
- P(s'|s,\pi,0) = {1 iff s'=s}
Policy iteration & value iteration are relatable in a precise way
[01/14/02 09:32 AM]
> Probability Theory
We want to know: if we've been to state s, and action a has led to
some distribution of results.. then what does probability theory tell
us about what we expect?
>> Characterizing a random variable
>>> Moments
A random variable x has moments:
# E[x^n] = \int p(x)x^ndx
We can write this very compactly by refering to the generating
function. Generating function:
# {\textflorin}(k) \triangleq E[e^{kx}]
# = E[\sum((k^nx^n)/n!)]
# = \sum((k^n/n!)E[x^n])
We can "generate" moments from the generating function by
differentiating.
# \partial/\partial x^n[{\textflorin}(x)]_{k=0} = E[x^n] = \int p(x)x^ndx
This can be signifigantly easier than differentiating
>>> Example: Bernoulli distribution
# x \in {0,1}
# {1-\mu if x=0
# p_x = {
# {\mu if x=1
# {\textflorin}(k) = E[e^{kx}]
# = (1-\mu) + \mu e^k
>>> Example: Gaussian
# p(x) = 1/root(s\pi\sigma^2) * e^{-(x-\mu)^2}/2\sigma^2
# \mu = E[x]
# \sigma^2 = E[x^2] - E[x]^2
# {\textflorin}(k) = E[e^{kx}]
>> KL Distance
A measure of distance between two probability distributions. (KL
divergence?)
Continuous:
# KL(p,q) \triangleq \int dx p(x)ln(p(x)/q(x))
Discrete:
# KL(p,q) \triangleq \sum p(x)ln(p(x)/q(x))
KL divergence is always non-negative.
Using the following lower bound on ln:
# ln(z) \geq 1-1/z
we can show that KL divergence is non-negative:
# KL(p,q) \geq \sum p(x)(1-(q(x)/p(x))
# = \sum (p(x)-q(x))
# = \sum p(x) - \sum q(x)
# = 0
KL divergence is asymmetric:
# KL(p,q) \neq KL(q,p)
Sefine step function:
# { 1 if z\geq0
# \theta(z) = {
# { 0 if z<0
Upper bound:
# \theta(z) \leq e^{kz} for all k>0
"Independantly identically distributed" = IID
Let x_1, \ldots, x_n be IID with mean \mu=E[x]
>> Central Limit Theorem (CLT)
- Large deviation theorem characterizes the tails of the
distributions.
- What about the center of the distribution?
Let x_1, x_2, \ldots, x_n be IID, with a mean \mu=E[x] and a varaiance
\sigma^2=E[x^2]-E[x]^2. Variance is bounded (\sigma^2<\infty). Let:
# y \triangleq 1/sqrt(n) \sum(x_i-\mu)
Then the distribution of this sum converges to a gaussian in the limit
that n becomes very large:
# lim_{n\to\infty} p(y) = 1/sqrt(s\pi\sigma^2) e^{-y^2/2\sigma^2}
Note: convergence is rapid for the center of the distribution, but
actually fairly slow for the tails.
[01/16/02 09:32 AM]
>> Rollout
Value iteration and policy iteration are special cases of rollout:
# V{\textasciicircum}_{t+1}(s) \gets R(s, \pi{\textasciicircum}_t(s)) +
# \gamma \sum P(s'|s, \pi{\textasciicircum}_t, 1)R(s', \pi{\textasciicircum}_t(s')) +
# \gamma^2 \sum P(s'|s, \pi{\textasciicircum}_t, 2)R(s', \pi{\textasciicircum}_t(s')) +
# \ldots +
# \gamma^k \sum P(s'|s, \pi{\textasciicircum}_t, k)R(s', \pi{\textasciicircum}_t(s')) +
> The learning problem
What if we only have the ability to sample the environment?
- not given the P(\cdot|s,a) or R(s,a)
\to find them by sampling. how to sample?
# M = generic name for a markov decision process (MVP)
# M = \langle S, A, P(\cdot|s,a) R(s,a)\rangle
All of these assume full observability (we always know what state
we're in) Also, we know S and A.
>> Generative models (aka simulators)
# [ ] \to S' distributed as P(\cdot|s,a)
# (s,a) \to [ G_M ]
# [ ] \to R = R(s,a)
Strong model of information access: we can sample any place in the
markov model. Queries don't interact (don't have to worry about the
"risk" of taking an unreversable action).
Simple strategy: just sample each (s,a) enough to estimate the
parameters.
Surprisingly, this is often available
- e.g., in a simulation, we can re-run the same situation
repeatedly. (e.g., computer backgammon: we can always play one
move from any board configuration, and see what happens).
Reasonable goal for the generative model: find the optimal policy
>>> Policy evaluation with generative models
- One approach: sample extensively, then use planning approach
- Better approaches?
Given policy \pi, learn V^\pi(s)
# V^\Pi(s) = R(s,\pi(s)) + \gamma(\sum_{s'}P(s'|s,\Pi(s'))V^\Pi(s))
# V^\Pi(s) = R(s,\pi(s)) + \gamma(\sum_{s'}P(s'|s,\Pi(s'),1)R(s,\pi(s))
# + \gamma^2(\sum_{s''}P(s''|s,\Pi(s''),2))V^\pi(s'')
This isn't too useful for planning (we can just use the simple
equation); but for learning, we have easier access to these
equations.
Define \tau to be an infinite sequence of \langle s,r\rangle values (a trajectory
throught the markov model) produced by the policy
# \tau = \langle s_0, r_0\rangle \to \langle s_1, r_1\rangle \to \langle s_2,r_2\rangle \to \ldots
Monte carlo algorithm: sample s/r\ldots
TD(K) = Temporal difference
- K = horizon
- \alpha = learning rate
Alternative algorithm to find V^\pi{\textasciicircum} (estimate of V^\pi)
# V^\pi{\textasciicircum}_{t+1}(s_0) \gets (1-\alpha)V^\pi{\textasciicircum}_t(s_0) +
# \alpha [r_0 + \gamma r_1 + \ldots + \gamma^{k-1}r_{k-1} + \gamma^k V^\pi{\textasciicircum}_{t}(s_k)
Usually, we just set \alpha=1/t
- this transforms things into a running average
"bais-variance" tradeoff for k:
- if k is large, we get error from increased randomness (sampling
larger groups of random variables)
- if k is small, V^\pi{\textasciicircum} gives us error from bias
>> Episodic learning (reset-to-start)
# reset \_\_\_\_\_\_\_\_\_\_\_
# \searrow\swarrow \nwarrow
# S\searrow [ ] \to S' distributed as P(\cdot|s,a)
# a \to [ G_M ]
# [ ] \to R = R(s,a)
Designated start state.
We can always reset to the start state; but we can't reset to just any
state..
Some parts of the state space may be arbitrarily difficult to get to.
(We may not care about those areas; they don't affect optimal actions
from other states)
(This can be generalized to a distribution of start states)
Now we have to worry about consequences, but we don't have an
exploration/exploitation tradeoff.
>> Full learning (No-reset model)
No reset at all.
# \_\_\_\_\_\_\_\_\_\_\_
# \swarrow \nwarrow
# S\searrow [ ] \to S' distributed as P(\cdot|s,a)
# [ G_M ]
# a \to [ ] \to R = R(s,a)
Exploration/exploitation tradeoff
What is reasonable to expect from the learning algorithm?
[01/23/02 09:31 AM]
> TD(k) and Policy Evaluation
>> Policy Evaluation
- Policy evaluation-planning
\to linear equations in V^\pi, so we can just solve for it.
- Phased TD(k) algorithm
\to in the case of planning, we might as well stick with k=1;
but for run-time learning, we may want to use phased TD(k) with
k>1.. (assume generative model: we can reset to any state)
# V{\textasciicircum}^\pi_{t+1}(s) \gets 1/n \sum_{i=1}^n [r_0^i + \gamma r_1^i + \ldots + \gamma^{k-1}r_{k-1}^i +
# \gamma^kV{\textasciicircum}^pi_t(s^i_k)]
Here, the "i"s are superscripts in r_0^i, etc.
Bias-variance tradeoff:
- if we trust V more, use smaller k: minimize error from variance of
markov process.
- if we trust V less, use larger k: minimize error from bias of V
As long as k<\infty, we'll never actually converge..
Q: why not just use k=\infty, do it once, and ignore V? Because we might
have a good a priori estimate of V (e.g., if we're doing policy
iteration, then presumably V doesn't change that much between
iterations).
# V^\pi(s) = E[r_0 + \gamma r_1 + \ldots + \gamma^kV^\pi(s_k)]
# V^\pi(s) = E[r_0] + \gamma E[r_1] + \ldots + \gamma^kE[V^\pi(s_k)]
The samples we are considering are r_t^i:
# r_0^1 + \gamma r_1^1 + \ldots + \gamma^kV^\pi(s_k^1)
# r_0^2 + \gamma r_1^2 + \ldots + \gamma^kV^\pi(s_k^2)
Basically, we're averaging each of these columns. One potential
problem: the colums don't represent independant random variables: if
we reached an ulikely set of states at t=1, then we probably reached
an unlikely set of states at t=2.
We will show uniform convergance: despite the dependancies, we'll
get a good estimate.
We'll consider a single column.. In particular, the set of rewards
r^i we get on a given time step in trial i of walking the mdp.
>>> Lemma 1
Let r^1, r^2, r^3, \ldots, r^n \in [0,1] iid random variables drawn according
to some distribution. E[r^i] = \mu. Then:
# P[|1/n \sum[r^i]-\mu| \geq \epsilon] \leq e^{-\epsilon^2n/3}
(this is a "large deviation bound")
I.e., the probability that the sampled mean is off from the true mean
by more than \epsilon is e^{-\epsilon^2n/3}.
So we want n=C/\epsilon^2.
>>> Lemma 2
But the thing we really want to bound is:
# P[|1/n \sum[r_j^i] - E[r_j]| \geq \epsilon for any j]
How do we do this? (Esp since they're not independant). Well, to be
conservative, we know the "union bound":
# P(a\lor b) \leq P(a)+P(b)
So now we've bounded the probability of each.. so we can immediately
say:
# P[|1/n \sum[r_j^i] - E[r_j]| \geq \epsilon for any j] \leq ke^{-\epsilon^2n/3}
# \leq e^{log(k)-\epsilon^2n/3}
Entropy vs energy interpretation: interpret "\epsilon^2n/3" as energy term,
"log(k)" as entropy term..
Introduce a confidence parameter \partial (upper bound on "failure
probability"):
# \partial \geq ke^{-\epsilon^2n/3}
# \epsilon = sqrt(3log(k/\partial)/n)
So with probability \geq (1-\partial) over phase t data, all the reward averages
are simultaneously within this \epsilon of their expectation.
>>> Error bound on policy evaluation
Define:
# \Delta_t \triangleq max_s\{|V{\textasciicircum}^\pi_t(s) - V^\pi(s)|\}
Then with probability \geq 1-\partial:
# \partial_{t+1} \geq ((1-\gamma^k)/(1-gamma)) * \epsilon + \gamma^k\Delta_t
# \partial_{t+1} \geq ((1-\gamma^k)/(1-gamma)) * sqrt(3*log(k/e)/n) + \gamma^k\Delta_t
This comes from:
# (1-\gamma^k)/(1-gamma) = 1 + \gamma + \gamma^2 + \ldots + \gamma^{k-1}
The "variance" part is:
# ((1-\gamma^k)/(1-gamma)) * sqrt(3*log(k/e)/n)
The "bias" part is:
# \gamma^k\Delta_t
When k is very large:
# \Delta_{t+1} \leq sqrt(3*log(k/\partial)/n)/(1-\partial)
When k=1:
# \Delta_{t+1} \leq sqrt(3*log(1/\partial)/n) + \gamma\Delta_t
We can then use the union bound over iteration phases, to limit our
overall error. Then we can solve \Delta_t..
# \Delta_t \leq \epsilon(1-\gamma^(kt))/(1\gamma) + \gamma^{kt}\Delta_0
# \Delta_t \leq \epsilon(1-\gamma^(kt))/(1\gamma) + \gamma^{kt}/(1-\gamma)
As t->\infty:
# \Delta_t \leq \epsilon(1/1-\gamma)
# \Delta_t \leq (1/1-\gamma)*sqrt(3*log(k/\partial)/n)
So our asymptotic error is nonzero. The residual asymptotic error is:
# \Delta_t \leq (1/1-\gamma)*sqrt(3*log(k/\partial)/n)
So smaller k gives smaller residual error.
But the convergence is really controlled by the \gamma^{kt} term, so the
larger k is, the faster we'll converge.
So decrease k as we go?
The algorithm "TD(\lambda)" has a continuous paramater \lambda (instead of k) that
basically allows you to do a weighted average the resut of using k=1,
k=2, \ldots. The extremes (0,1) correspond to k=1 and k=\infty.
> Learning Optimal Policies
>> Q-Learning
Value iteration:
# Q^*(s,a) = R(s,a) + \gamma \sum_{s'}P(s'|s,a)V^*(s')
# = R(s,a) + \gamma \sum_{s'}P(s'|s,a)max_b{Q^*(s',b)}
Use this formulation to produce a learning algorithm, q-learning: if
we experience the transition [s{\textendash}a/r\to s'] (s to s' with action a and
reward r):
# Q{\textasciicircum}_{t+1}(s,a) \gets(1-\alpha)Q{\textasciicircum}(s,a) +
# \alpha[r+\gamma max_bQ(s',b)]
How much data do we need for this estimate to converge to Q^*? (as a
funciton of the number of states in the environment)
[02/04/02 09:32 AM]
> Overview
Where we're going..
- We need compact representations to encode a complex world
- mdp isn't compact enough -- exponential blowup of independant
inputs.
- how can we discover structure in the world?
- clusters (e.g., all states such that f(s) is true)
- embedding (compression?)
- factorization: what is independant?
- use probablistic models to learn structure
- bayes nets
- aka probablistic graphical models
- provide a unified framework for encoding structure
- introduce examples over the next few weeks; then show the
general framework that they fit in.
this week:
- predictions based on conditional probabilities
- conditional probabilities (CPs) for (few) discrete random
variables can be tabulated
- consider continuous random variables (we can't tabulate).
conditional probabilities must be paramaterized. natural
parametric forms?
- simplest examples: linear & logistic regression
outline for this week:
- linear programming
- matlab
- logistic regression
- gradient ascent
- newton's method
> Linear Regression
Predictions based on continuous variables.
Goal: predict a continuous output y\in\setR from one or more inputs
X=(x_1,x_2,\ldots,x_n)\in\setR^d.
# [y] output
# \nearrow \nearrow \uparrow \nwarrow
# [x_1] [x_2] [x_3] \ldots [x_n] inputs
Model this input/output by a noisy linear map. I.e., our model of
predicting y is:
- y = \sum_i w_ix_i + noise
Abbreviate sum as inner product:
- y = W\cdot X
>> Noise
Model noise by a conditional gaussian distribution.
- P(y|X) = 1/sqrt(2\pi) e^{-1/2(y-W\cdot X)^2}
- E(y|X) = W\cdot X
Can we use this model to learn to predict the future from the past?
If examples are iid from the joint distribution p(X,y), then what is
the probability of the data we saw under this model?
- P(y_1,y_2,\ldots,y_n|X_1,X_2,\ldots,X_n)
Since we're assuming examples are iid, we can reduce this to:
- P(y_1|X_1)P(y_2|X_2)\ldots P(y_n|X_n)
Learning problem: estimate the parameters of the distribution (=W)
Maximize the likelihood:
# argmax[W] P(y_1,y_2,\ldots,y_n|X_1,X_2,\ldots,X_n)
to make it easier, maximize the log likelihood:
# argmax[W] \sum log P(y_1|X_1)
# = argmax[W] -\sum[1/2 log(2\pi) + 1/2(y-W\cdot X_n)^2]
# = argmax[W] -\sum1/2(y-W\cdot X_n)^2
# = argmax[W] -\sum(y-W\cdot X_n)^2
So it's the same as minimizing the sum of squared error
>> How do we maximize a function of several variables?
To find a maximum, use the gradient, and set gradient=0.
L(w) = likelihood
# L(w) = -\sum[1/2 log(2\pi) + 1/2(y-W\cdot X_n)^2]
# \partial L/\partial w_\alpha = -\sum(y-W\cdot X_n)(-x_{\alpha n})
So for all \alpha, we should have:
# \partial L/\partial w_\alpha = 0
This yields a tractable set of linear equations, that we can
explicitly solve.
# \partial L/\partial w_\alpha = \sum(yx_{\alpha n}-W\cdot X_nx_{\alpha n}) = 0
# \sum yx_{\alpha n} - \sum W\cdot X_nx_{\alpha n} = 0
# \sum yx_{\alpha n} = \sum W\cdot X_nx_{\alpha n}
# \sum yx_{\alpha n} = \sum_n(\sum_\beta w_\beta x_{\beta n})x_{\alpha n}
# \sum yx_{\alpha n} = \sum_\beta(\sum_n x_{\alpha n}x_{\beta n})w_\beta
d equations, d unknowns. use matrix inversion to solve.
>>> outer product
define outer product of d-dimensional vectors:
# [UV^T]_{\alpha\beta} = u_\alpha v_\beta
outer product is a dxd square matrix, where elemnt (i,j) is u_iv_j.
>>> getting back to our gradient ascent\ldots
Vector notation for solution:
# [\sum_n x_nx_n^T]W = \sum_m y_mx_m
# W = [\sum_n x_nx_n^T]^{-1}(\sum_m y_mx_m)
Set:
# A = [\sum_n x_nx_n^T]^{-1}
# B = \sum_m y_mx_m
When do we have trouble inverting the matrix?
- if we have more inputs than examples
- if the examples all lie in a smaller dimentional space
Options when we can't invert:
- minimum norm: min{|w|^2} such that \partial L/\partial W = 0
- weight decay: max{L(W)-C|W^2|
> Logistic Regression
How can we predict a binary output y\in{0,1} from
X=(x_1,x_2,\ldots,x_n)\in\setR^d?
# [y] output
# \nearrow \nearrow \uparrow \nwarrow
# [x_1] [x_2] [x_3] \ldots [x_n] inputs
Model input-output by a conditional Bernoulli distribution:
# P(y=1|X) = \sigma(W\cdot X)
Where we define \sigma(\cdot) as:
# \sigma(z) = 1/(1+e^{-z})
# plot 1/(1+exp(-x));
Properties of \sigma(\cdot):
- \sigma(z)\to1 as z\to\infty
- \sigma(z)\to-1 as z\to-
\infty
- \sigma(-z) = 1-\sigma(z)
- \partial/\partial z \sigma(z) = \sigma(z)\sigma(-z)
>> Learning from examples
Assume iid examples y_n\in{0,1}
# P(y_1, y_2, \ldots, y_n|X_1,X_2,\ldots,X_n) = \prod p(y_n|X_n)
Estimate parameters W by maximizing the log likelihood
# argmax[W] \sum log P(y_1|X_1)
# = argmax[W] \sum[ log y_nlog[\sigma(W\cdot X_n)] + (1-y_n)log[1-\sigma(W\cdot X_n)] ]
# = argmax[W] \sum[ log y_nlog[\sigma(W\cdot X_n)] + (1-y_n)log[\sigma(-W\cdot X_n)] ]
This is a nonlinear function, whose maximum can't be computed in
closed form. So we can't just set the gradient to zero and solve.
Use iterative algorithms.
[02/06/02 09:39 AM]
Consider: continuous state sequence. problem: predict our next
state.
Gradient ascent:
# W \gets W + \eta\partial L/\partial W
# W \gets W - 1/(\partial^2L/\partial W\partial W^T) \partial L/\partial W
[02/11/02 09:31 AM]
> pset 1
pset1 outside moore 554. \mu=30/50
> Review
2 types of optimization problems:
- linear
- nonlinear
* convex: has more structure than generic nonlinear; but more
complex than linear.
Types of solutions:
- exact (clsoed form)
- iterative
Types of convergence:
- finite (polynomial) time
- asymptotic
* monotonic: each iteration improves the answer
> Convex Sets and Functions
>> Introduce convex sets & functions in 1d.
>>> Convex Set
- A set \Omega \in \setR is convex if \forall x,y\in\Omega:
- (1-t)x + ty \in \Omega for t\in[0,1]
- i.e., a set is convex if we can pick 2 points, and any point
between them is in the set.
>>> Convex Function
- For any line we can draw between 2 points on the function, the
function lies below that line.
- A function f(x) is convex on \Omega if for all points x,y\in\Omega:
- f((1-t)x + ty) \leq (1-t)f(x) + tf(y) for t\in[0,1]
- Note:
- (1-t)f(x) + tf(y) for t\in[0,1] is the line between 2 points.
- f((1-t)x + ty) is the function value between 2 points.
Some convex functions:
- x^2
- e^x
f(x) is concave if (-f(x)) is convex. Some concave functions:
- log(x)
- sqrt(x)
>>> Local Minima are Global minima
Theorem: If f(x) is convex on \Omega, then:
- Any local minimum is a global minumum.
- (of course, any global minimum is a local minimum.)
Proof by contradiction:
- let x^* be a local minimum of f
- assume \exists y s.t. f(y)>> Convex Functions and Tangents
Theorem: if f(x) is once differentiable, then it is convex on \Omega iff:
- f(y) \geq f(x) + f'(x)(y-x)
Intuition: f(x)+f'(x)(y-x) is the tangent. So what we're saying is
that the tangent of any point lies below or at the function at every
point..
Proof (\Rightarrow):
- for t\in[0,1]: f((1-t)x+ty) \leq (1-t)f(x)+tf(y)
- so f(1-t)x+ty)-f(x) \leq t[f(y)-f(x)]
- so (1/t)[f(1-t)x+ty)-f(x)] \leq \leq f(y)-f(x)
- lhs as t\to0 is the definition of derivative. so take lim t\to0
- f'(x)(y-x) \leq f(y)-f(x)
Proof (\Leftarrow):
- Let x_1,x_2\in\Omega. Let x=(1-t)x_1+tx_2. So x\in[x_1,x_2].
- By assumption: (Eq. A) f(x_1) \geq f(x) + (x_1-x)f'(x)
- By assumption: (Eq. B) f(x_2) \geq f(x) + (x_2-x)f'(x)
- Take linear combination of (Eq. A) and (Eq. B):
- (1-t)f(x_1) + tf(x_2) \geq [(1-t)+t]f(x) + [(1-t)(x_1-x)+t(x_2-x)]f'(x)
- (1-t)f(x_1) + tf(x_2) \geq f(x) + [(1-t)x_1+tx_2 - (1-t)x-tx]f'(x)
- By our definition of x:
- (1-t)f(x_1) + tf(x_2) \geq f(x)
- (1-t)f(x_1) + tf(x_2) \geq f((1-t)x_1+tx_2)
>>> Convexity By Nonnegative Second Derivative
Theorem: if f(x) is twice differentiable, then it is convex on \Omega iff:
- f''(x) \geq 0 \forall x\in\Omega
Proof intuition:
- Use a taylor expansion
- Take the following to be true, on faith: \forall x,y \exists z\in[x,y] s.t.
- f(y) = f(x) + f'(x)(y-x) + (1/2)f''(z)(y-x)^2
- Then f''(z)\geq0 implies convexity..
- f(y) \geq f(x) + f'(x)(y-x)
Examples:
- d^2/dx^2 e^x = e^x \geq 0
- d^2/dx^2 x^2 = 2 > 0
- d^2/dx^2 -log x= 1/x^2 \geq 0
- d^2/dx^2 -ln \sigma(z) = -\sigma(z)\sigma(z) \geq 0
>>> Jensen's inequaltiy
Theorem: let f(x) be a convex function of a random variable x\in\setR.
- E[f(x)] = \int dx p(x)f(x)
or:
- E[f(x)] = \sum_i p(x_i)f(x_i)
Note: x can be defined over discrete set, even though f must be
defined on a continuous set.
Then:
- E[f(x)] \geq f(E[x])
Proof (assuming f(x) is once differentiable):
- E[f(x)] \geq E[f(z)+(x-z)f'(z)] \forall z
- E[f(x)] \geq f(z) + (E[x]-z)f'(z)
- Take z = E[x] (z\in\Omega)
- E[f(x)] \geq f(E[x])
Examples:
- E[x^2] \geq E[x]^2
- E[e^{xk}] \geq e^{kE[x]^2}
- E[log(x)] \leq log E[x]
>> Convex Dualtiy
Motivation: if f(x) is convex, and we choose some slope \lambda for a line,
then where will it intercept f(x) such that it touches but is below..
I.e., find the "intercept function" -g(\lambda) such that:
- f(x) \geq \lambda x-g(\lambda)
Ideally, we'd like a tight bound.. :)
A convex function f(x) can be represented in terms of a "conjugate"
(or "dual") function g(\lambda):
- f(x) = max_\lambda{\lambda x-g(\lambda)}
The dual function (which is also convex) is given by:
- g(\lambda) = max_x{\lambda x-f(x)}
These are "Legendre transformations"
Example:
- f(x) = e^x
- g(\lambda) = \lambda log\lambda-\lambda
Which means that:
- e^x \geq \lambda x + \lambda - \lambda log\lambda
Example:
- f(x) = log(x)
- g(x) = 1 - log(1/\lambda) = 1+log(\lambda)
Which means that:
- f(x) \leq \lambda x - 1 - log\lambda
>> Higher Dimentions
>>> Convex Sets
- If you pick any 2 points, then everything between them must be
included.
>>> Conditions
# d=1 d>1
# --------------------
# x\in\setR x\in\setR^d
# f'(x) df/dX or \nabla f(x)
# f''(x)\geq0 (see below)
# duals f(x) \geq \Lambda X - g(\Lambda) X\in\setR^d
below:
- For all V \in \setR^d:
- V^T \partial^2f/dXdX^T V \to 0
[02/18/02 09:28 AM]
> Review... Monotonic Convergence
>> Auxilliary function
Given a (log) likelihood function L (concave) that we wish to
maximize, define an auxilliary function Q(w,w_t) continuous in w and
w_t s.t.: - L(w) = Q(w,w) - L(w) \geq Q(w,w_t)
Pick Q such that it's easy to maximize. Then by maximizing Q around
w_t, we can find a new w that's better than the old one.
# w_{t+1} = argmax_w Q(w,w_t)
>> Logistic Regression
# L(w) = L_+(w) - L_-(w)
# L_+(w) = \sum_n(w\cdot x_n) y_n
# L_-(w) = \sum_nlog[1+exp(w\cdot x_n)]
>> Multiplicitive Updates:
# [(dL_+/dw_\alpha)] \eta
# exp \longleftarrow exp(w_\alpha) [---------]
# [(dL_-/dw_\alpha)]
Where:
# \eta = max_n \sum_\alpha x_{\alpha n}
(Assumes x_{\alpha n}\geq0)
i.e., the learning rate is set by looking at the "densest" image.
E.g., for an image (where brighter pixels have higher values), the
brightest image sets the larning rate.
This update rule converges monotonically.
>> Auxilliary Function
So what auxilliary function do we want to use?
# Q(w,w_t) = L(w_t) + \sum_{\alpha n}\{y_n(w_\alpha-w_{\alpha t}) -
# (1/S)\sigma_n^{(t)}[exp(s(w-w_{\alpha t}))-1]
Where:
# \sigma_n^{(t)} \equiv 1/(1+exp(-w_t\cdot x_n))
# \partial/\partial w Q(w,w_t) = 0
# \sum_n x_{\alpha n} [y_n-\sigma_n^(t) exp(s(w_\alpha-w_{\alpha t}))]
# exp(s(w_\alpha-w_{\alpha t})(\sum_n\sigma_n^{(t)}x_{\alpha n} = \sum_nx_{\alpha n}y_n
We're only computing the gradient (not the hessian), so it's cheaper
than other methods (in high dimensionality).
>> Constraints
- algorithm is tailored to sparse nonnegative input
- nonnegativity is required: otherwise derivitives give complex
numbers, etc.
- sparsity is desirable: the densest vector limits our learning
rate.
- but: for negative data, we can just adjust..
> State Aggregation
Conditional Probabilities for large discrete variables must be
paramaterized. I.e., P(s'|s) is just too big to write out in a
tabular form.
Aggregate states into clusters, and work on the cluster level.
- Assume we have a discrete markov process
- markov process has a countable state space S.
- markov process has probablistic dynamics P(s'|s)
- What if |S| is very large?
- Difficult to elicit P(s'|s) from experts
- Difficult to learn from examples (many parameters)
- Why build a model at all?
- Transfer: if we slightly change the task, models are very
useful.
Transform original state space S into an aggregate state space.
- Imagine that states s\in S can be mapped to clusters c\in C and vice
versa:
- P(c|s) = probability that state s is mapped to cluster c
- P(s'|c) = prob that a state s in cluster c is followed by state
s'.
So we could use an alternate notation like:
- P(c_t|s_t) = prob that state s_t is mapped to cluster c_t
- P(s_{t+1}|c_t) = prob that state s_t is followed by s_{t+1}, given
that state s_t maps to cluster c_t.
(but we won't :) )
State transitions s\to c\to s' are mediated by cluster assignments. Thus:
- P(s'|s) = \sum_cP(c|s)P(s'|c)
So we're clustering, but allowing probablistic fuzziness.
We are using a "matrix factorization" to provide a compact
representation:
- P(s'|s) is a |S|\times|S| matrix (call it A)
- P(c|s) is a |S|\times|C| matrix (call it B)
- P(s'|c) is a |C|\times|S| matrix (call it C)
so we're modelling A as:
- A = B \times C
We're adding some complexity cost: instead of table lookup, we have
matrix multiplicaiton (an inner product for a single entry).
Represent this model as a graph:
# S \to C \to S'
(Each link characterized by a matrix, but we explictly do *not* have a
link from S to S')
This decomposition also gives us transitions between clusters..
- P(c_{t+1}|c_t) = \sum_sP(c_{t+1}|s)P(s|c_t)
Inferring cluster labels:
# P(c|s,s') = P(s',c|s)/P(s'|s)
#
# P(s'|c)P(c|s)
# = ---------------------
# \sum_{c'}P(s'|c')P(c'|s)
So given P(c|s) and P(s'|c), we can compute:
- P(s'|s) (inner product)
- P(c|s,s') (bayes rule)
>> Learning
so how do we learn?
>>> Learning from "complete" data
Suppose we have fully labeled trajectories of the form:
- s_1\to c_1\to s_2\to c_2\to s_3\to\cdots\to s_T
What's the probability of this trajectory under our model?
# P(\{s_t,c_t\}_{t=1\ldots T}) = \prod_t P(s_{t+1}|c_t)P(c_t|s_t)
How many times does P(s_{t+1}=s|c_t=c) appear? Define counts:
- N(s\to c) = \# of times state s is followed by cluster c
- N(c\to s) = \# of times cluster c is followed by state s
Use these counts to rewrite P({s_t,c_t}):
# P({s_t,c_t}) = \prod_s\prod_c P(s|c)^{N(c\to s)}P(c|s)^{N(s\to c)}
We want to maximize the log likelihood of the data (L_C=log likelihood
for complete data):
# L_C = \sum_s\sum_c log(P(s|c))N(c\to s)+log(P(c|s))N(s\to c)
Then we just do learning by maximizing this log likelihood. In
particular, we want to maximize L_C subject to the constraints:
# \sum_cP(c|s) = 1
# \sum_sP(s|c) = 1
>>> Solution to the complete data form:
Maximum likelihood estimates:
# P(c|s) = N(s\to c)/\sum_{c'}N(s\to c')
# P(s'|c) = N(c\to s')/\sum_{s''}N(c\to s'')
>>> Learning from "incomplete" data
Suppose we just have partially labeled trajectories of the form:
- s_1\to s_2\to s_3\to\cdots\to s_T
I.e., we have a hidden cluster space..
(this seems very similar to HMMs?)
What's the probability of this trajectory under our model?
# P(\{s_t\}_{t=1\ldots T}) = \prod_t P(s_{t+1}|s_t)
# = \prod_t \sum_{c_t} P(s_{t+1}|c_t)P(c_t|s_t)
[02/20/02 09:38 AM]
> Learning from Incomplete Data
We have "partially" labeled trajectories
- s_1\to s_2\to s_3\to\cdots\to s_T
What's the probability of this trajectory under our model?
# P(\{s_t\}_{t=1\ldots T}) = \prod_t P(s_{t+1}|s_t)
# = \prod_t \sum_{c_t} P(s_{t+1}|c_t)P(c_t|s_t)
Let N(s\to s') count the number of transitions from s to s' in the data.
# P(s_1,s_2,\ldots,s_T) = \prod_{s,s'} \sum_c_P(s'|c)P(c|s)]^{N(s\to s')}
Log likelihood (H for "hidden"):
# L_H(s_1,s_2,\ldots,s_T) = \sum_{s,s'} N(s\to s')log\sum_c[P(s'|c)P(c|s)]
Then we just do learning by maximizing this log likelihood. In
particular, we want to maximize L_H subject to the constraints:
# \sum_cP(c|s) = 1
# \sum_sP(s|c) = 1
We can't solve this in closed form. So use an iterative solution.
Use the EM algorithm
>> EM algorithm
Intuitive motivation:
- in the complete-data situation, to maximize the likelihood in the
observed case, we normalized observed counts N(s\to c) and N(c\to s')
- to maximize log likelihood for the hidden-variable setting:
- compute inferred counts of clusters (E-step):
- Nhat_k(s\to c) = \sum_{s'}N(s\to s')P_k(c|s,s')
- Nhat_k(c\to s') = \sum_{s}N(s\to s')P_k(c|s,s')
- where P_k is given by Bayes' rule
- use Nhat_k to find MLE for P_{k+1}
- P_{k+1}(c|s) = Nhat_k(s\to c)/\sum_{c'}Nhat_k(s\to c')
- P_{k+1}(s|c) = Nhat_k(c\to s)/\sum_{s'}Nhat_k(c\to s')
These updates give monotonic convergence..
These are multiplicitive updates.
# \partial L_h
# ------- = \alpha_{cs}
# \partial P(c|s)
# \partial L_h
# ------- = \beta_{s'c}
# \partial P(s'|c)
The m-step updates can also be written as:
# \alpha_{cs}
# P_{k+1}(c|s) = P_k(c|s) ----------------
# \sum_c\alpha_{cs}P_k(c'|s)
# \beta_{sc}
# P_{k+1}(s|c) = P_k(s|c) ----------------
# \sum_c\beta_{s'c}P_k(s'|c)
>> Proof of Monotonic Convergence
(I got bored)
[02/25/02 09:31 AM]
> Vector Quantization
Assign an integer label to each vector (basically, we're doing
automatic clustering). Choose k centroids. Assign each data point to
the nearest centriod; re-estimate the centroids; repeat. For nicely
clustered data, this can give useful clusters.
K-means algorithm for data {X_n} (each X_n is a vector)
1. Choose \mu_i at frandom for i=1,2,\ldots,k (each \mu_i is a vector)
2. Assign y_n=argmin_i |x_i-x_n|^2
3. Set \mu_i to the mean of {x_n}y_n=i
4. Go to step 2 (until convergence)
Questions:
- what type of data is algorithm suited to?
- what exactly is being optimized?
- why does it converge?
- what if there are missing features?
>> Clustering
Goal:
- divide data into clusters
- assign new data to clusters
>> Mixture Models
Vector quantization is a member of a probablistic set of models called
"mixture models."
>>> Goal
- model the distribution p(x) from which iid training examples are
drawn. iid examples are {x_n}.
- Generative model (intuition)
1. Roll a k-sided (non-fair) die, with probabilities \rho_i (i=1..k),
\sum_i\rho_i = 1
2. Based on the outcome of the die, sample x from one of k
"component distributions."
3. Identify each component distribution with a cluster.
The overall pdf is a weighted sum of the individual distributions. If
each component distribution is relatively tight, and sparse, then we
can find them..
>>> Hidden variable
- In addition to x (observed variable), introduce a discrete hidden
(unobserved, latent) variable z \in \setZ.
- z's pdf is given by \rho_z (die probabilities).
Define a joint pdf P(x,z) = P(x|z)P(z)
So P(x) = sum_z P(x|z)P(z)
"mixture model" terminology: we're "mixing" k different distributions.
>>> Graphical model
# z \to x
>>> Component distributions
x is continuous, so we need a parametric form. A useful choise is the
multivariate gaussian.
- basically, a guassian among multiple dimensions.
# P(x|z=i) = 1/sqrt((2\pi)^d|\Sigma_i|) exp\{-1/2 (x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)
Each component (cluster) i has its own mean and covariance:
# \mu_i = expected value of x (k-vector)
# = E[x|z=i]
# \Sigma_i = covariance matrix (k-by-k matrix)
# = E[(x-\mu_i)\cdot(x-\mu_i)|z=i]
note! \Sigma is covariance, not sum (=\sum).
Covarinace matrix \Sigma_i must be positive definite (basically the equivalant
to saying that a real number must be positive). Other than that, it
can have any value.
# P(x|z=i) = 1/sqrt((2\pi)^d|\Sigma_i|) exp\{-1/2 (x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)
# |\Sigma_i| = determinant
# \Sigma_i^{-1} = matrix inverse..
Benefits:
- modelling assumptions are explicit
- know what data will be well-suited to the algorithm
- inferences of many different types are supported by the model.
- e.g., clustering/vector quantization.
- restoration of "missing" features
- novelty/outlier detection
- learning -- viewed as maximum likelihood estimation
>> Learning from labeled examples
Suppose we have labeled examples {x_n, z_n} where z_n \in {1\ldots k}
Define indicator variables:
# { 1 if z_n = i
# Z_{in} = {
# { 0 otherwise
The probability of iid takes the following form:
# P({x_n,z_n}) = \prod_{n=1..N}P(x_n|z_n)P(z_n)
# = \prod_n\prod_i \{P(z_n=i)P(x_n|i)\}^{Z_{in}}
# = \prod_n\prod_i \{\rho_iP(x_n|i)\}^{Z_{in}}
(Fill in P(x_n|z=i) from big ugly equation above :) )
Work with log likelihood (for fully-observed case) L_O:
# L_O = \sum_{in} Z_{in}log(\rho_i) - Z_{in}log(P(x_n|i))
# L_O = \sum_{in} Z_{in}[log(\rho_i) - 1/2(x_n-\mu_i)^T\Sigma_i^{-1}(x_n-\mu_i)
# -dlog(2\pi)/2 - log(|\Sigma_i|)/2]
Learning is maximizing L_O with respect to {\rho, \mu, \Sigma} subject to the
constraints:
- \sum_i\rho_i = 1
- \Sigma_i covariances are positive definite
>>> Solution
- Let N_i = \sum_nZ_{in} counts the number of examples labeled by the
ith cluster.
- \rho_i = N_i/N
- \mu_i = 1/N_i \sum_nZ_{in}x_n
- \Sigma_i = 1/N_i \sum_n Z_{in}(x_n-\mu_i)(x_n-\mu_i)^T
>> Learning from unlabeled examples
Suppose we have unlabeled examples x_n.
# P(x_1,\ldots,x_n) = \prod_n\sum_zP(x_n|z)P(z)
Work with log likelihood (for hidden-variable case) L_H:
# L_H = \sum_n log [\sum_zP(x_n|z)P(z)]
# L_H = \sum_n log [\sum_iP(x_n|z=i)\rho_i]
(Again, substitute in P(x_n|z=i) from above)
Learning is maximizing L_H with respect to {\rho, \mu, \Sigma} subject to the
constraints:
- \sum_i\rho_i = 1
- \Sigma_i covariances are positive definite
Can't solve in closed form.
>>> Solution: use EM
> EM
Intuitive motivation.. 2 steps:
1. E-step:
- to maximize L_O, we just used counts N_i of the data in each
cluster.
- But we don't have direct access to these counts, since the
training samples are unlabeled.
- We can make guesses, though: try to find a value for N_i that
will make L_H higher than our current estimate, even if it
doesn't give the right value.
- Use Bayes rule:
- P(z|x_n) = P(x_n|z)P(z)/p(x_n)
- P(z|x_n) = P(x_n|z)P(z)/(\sum_zP(x_n|z)P(z))
- P(z=i|x_n) = P(x_n|z=i)\rho_i/(\sum_iP(x_n|z=i)\rho_i)
- Define a surrogate for the real labels:
- \gamma_{in} \equiv P(z=1|x_n)
- Fill this in with our definitions for P(x_n|z=i)
- Use the surrogate in place of the indicator variable Z_{in}.
- Use \gamma to calculate Nhat.
2. M-step
- Use Nhat and the equations from the fully-observed case to
update the parameters:
- \rho_i = Nhat_i/Nhat
- \mu_i = 1/Nhat_i \sum_n\gamma_{in}x_n
- \Sigma_i = 1/Nhat_i \sum_n \gamma_{in}(x_n-\mu_i)(x_n-\mu_i)^T
>> Proof of Monotonic Convergence
Let:
# H = hidden variable
# V = visible variable
# \theta = current parameters
# \theta' = updated parameters
For the mixture model with gaussians:
- H = {z}
- V = {x}
- \theta = {\rho,\mu,\Sigma}
But we'll prove convergence for the general case.
# L(\theta') = \sum_n log P(V_n|\theta')
# L(\theta') = \sum_n log P(H_n,V_n|\theta')/P(H_n|V_n,\theta')
This is true for any values of H_n. In particular, we can construct
L(\theta') as a weighted sum of values that equal L(\theta'). I.e., it is
always true that:
# \sum_xZp(x) = Z
if Z doesn't depend on x. In our case, Z=L(\theta'). So we can use a
weighted sum over the PDF for H_n.
# L(\theta') = \sum_n \sum_{Hn} P(H_n|V_n,\theta)[log P(H_n,V_n|\theta')/P(H_n|V_n,\theta')]
# L(\theta') = \sum_n \sum_{Hn} P(H_n|V_n,\theta)[log P(H_n,V_n|\theta')/P(H_n|V_n,\theta) +
# log P(P(H_n|V_n,\theta)/P(H_n|V_n,\theta'))]
We can interpret:
# log P(P(H_n|V_n,\theta)/P(H_n|V_n,\theta'))
as the KL distance between distributions paramaterized by \theta and \theta'.
KL-distance is always positive: use it to generate an inequality.
# L(\theta') \geq \sum_n \sum_{Hn} P(H_n|V_n,\theta) log[P(H_n,V_n|\theta')/P(H_n|V_n,\theta)]
# Q(\theta',\theta) \equiv \sum_n \sum_{Hn} P(H_n|V_n,\theta) log[P(H_n,V_n|\theta')/P(H_n|V_n,\theta)]
- L(\theta') = Q(\theta',\theta')
- L(\theta') \geq Q(\theta',\theta)
It follows that L(\theta) improves monotonically under:
- \theta' = argmax_{\theta'} Q(\theta',\theta)
[02/27/02 09:32 AM]
> Mixture Model
we need to solve for parameters
- \rho_i = P(z=i)
- \mu_i, \Sigma_i = multivariate gaussian parameters
# H = hidden variable
# V = visible variable
# \theta = current parameters
# \theta' = updated parameters
Find auxilliary function Q of the log likelihood:
# Q(\theta',\theta) \equiv \sum_n \sum_{Hn} P(H_n|V_n,\theta) log[P(H_n,V_n|\theta')/P(H_n|V_n,\theta)]
Use iterative scaling:
# \theta' = argmax_{\theta''} Q(\theta'', \theta)
>> EM
>>> E step
Compute the term in Q(\theta',\theta) that depends on \theta'
# Q(\theta',\theta) \equiv \sum_n \sum_{Hn} P(H_n|V_n,\theta) log[P(H_n,V_n|\theta')] + \ldots
(We don't care about the second term, since it doesn't depend on \theta'.
So, it won't affect our argmax!)
We can rewrite this as a posterior expectation of the log joint
distribution:
# Q(\theta',\theta) = \sum_n E[log P(H_n,V_n|\theta') | V_n,\theta]
(Thus the name E-step for "expectation")
For the mixture model:
# \prod_n P(x_n,z_n) = \prod_n\prod_i \{\rho_iP(x_n|i)\}^{Z_{in}}
where:
# { 1 if z_n = i
# Z_{in} = {
# { 0 otherwise
Note that in this context, Z_{in} are random variables (since the
labels z_n are random variables). But they have very simple expected
values:
# E[Z_{in}] = P(z_n=i)
# E[Z_{in}|x_n] = P(z_n=i|x_n) = \gamma_{in}
# \sum_n E[log P(H_n,V_n|\theta') | V_n,\theta]
# = E[ \sum_n z_{in} (log \rho_i' - 1/2(x_n-\mu_i')^T\Sigma_i'(x_n-\mu_i')-
# 1/2log(2\pi)^d |\Sigma_i'|
# |x_n,\theta]
# = \sum_{ni} \gamma_{in} (log \rho_i' - 1/2(x_n-\mu_i')^T\Sigma_i'(x_n-\mu_i')-
# 1/2log(2\pi)^d |\Sigma_i'|
# |x_n,\theta]
# \gamma_{in} = P(z_n=i) = P(x_n,z_n=i)/P(x_n)
(compute with \theta, not \theta')
> Union Model
Consider a binary event z \in {0,1} that is triggered by one or more
precursor events y_i \in {0,1} (n.b.: strong notion of causality):
# z = \bigcup_i y_i
This is one possible model of "sensor fusion". Sensors are noisy.
Examples
- Security: an alarm is sounded when you detect one or more
intrusions. But there might be false alarms, wind, etc.
- Vision: an object is spotted if it appears in any subregion of the
viewing area.
- Speech: to detect if someone is speaking, break signal into freqs,
and see if we detect speech at any freq.
to:, cc:, attchmnt:, subject:
[03/04/02 09:33 AM]
# [z [y_1 x_1 x_2 ... x_k] [y_2 x_1 x_2 ... x_k] ... [y_n x_1 x_2 ... x_k] ]
>> Sensor model
# P(y_i=1|x-i) = \sigma(w_i\cdot x_i)
If any sensor fires (y=1), then the output (z) is 1.
# P(z=1|y) = 1-\prod_i(1-y_i)
Take probability that sensors didn't fire; take product of that to get
probability that no sensors fired; one minus that is the probability
that any sensor fired..
P(z=1|{x_i}) = 1-\prod_i\sigma(-w_i\cdot x_i)
>> Interference
Given that we know something about z, what inferences can we make
about the y's and x's? (x's fixed)
If z didn't fire, then we know that all y_i's didn't fire.
# E[y_i|z=0] = P(y_i=1|z=0) = 0 for all i
If we learn that z fired, then it can only increases our confidence
that each of the y_i's fired:
# E[y_i|z=1,{x_j}] = P(y_i=1|z=1,{x})
# \geq P(y_i=1|{x})
# = E[y_i|x_i]
Proof:
# P(y_i=1|z=1,{x}) = P(y_i=1, z=1|{x})/P(z=1|{x})
Because of our model, z is independant of the x's, given the y's.
# P(y_i=1|z=1,{x}) = P(y_i=1|{x})P(z=1|y_i=1)/P(z=1|{x})
We know that P(z=1|y_i=1)=1 so:
# P(y_i=1|z=1,{x}) = P(y_i=1|{x})/P(z=1|{x})
# P(y_i=1|z=1,{x}) = \sigma(w_i\cdot x_i)/[1-\prod_i\sigma(-w_i\cdot x_i)]
Denominator is always less than one:
# P(y_i=1|z=1,{x}) \geq \sigma(w_i\cdot x_i) = P(y_i=1|{x})
>> Learning
Can we use a global learning algorithm, without combinatorial
explosion of training time? Use EM..
>>> Incomplete Data scenario
Assume we have a set of partially labelled examples {\langle x_{in},z_{in}\rangle}
(n indexes over examples). E.g., for visual id, we are told which
images contain the target, but not where in the image.
[Footnote: If the x_{in} are actually different instances of the same
object, then this type of scenario is called "multiple instance
learning." In reality, this is just a special case of sensor
fusion..?]
Maximize log likelihood:
# L = \sum_n log P(z_n|x_{1n}, x_{2n}, \ldots, x_{kn})
Break into positive and negative examples:
# L = \sum_n z_n log P(z_n=1|x_{1n}, x_{2n}, \ldots, x_{kn}) +
# (1-z_n) log P(z_n=0|x_{1n}, x_{2n}, \ldots, x_{kn})
# L = z_n log[1-\prod_i\sigma(-w_i\cdot x_i)] + (1-z_n) log [\prod_i\sigma(-w_i\cdot x_i)]
# L = z_n log[1-\prod_i\sigma(-w_i\cdot x_i)] + (1-z_n) \sum_i log \sigma(-w_i\cdot x_i)
Highly nonlinear: use an iterative algorithm.
>> EM
- EM is an iterative algorithm for optimizing the (global) log
likelihood.
- But, at each iteration, it decouples different sensors.
>>> E step
# \theta = {w_i}
# \theta' = {w_i'}
# \sum_n E[log P(y,z|{x_{in}},\theta') | z_n,\theta]
# \gamma_{in} = E[y_i|z_n,{x_{jn}},\theta] =
# P(y_i=1|z_n,{x_{in}},\theta) =
# z_n \sigma(w_i\cdot x_i)/(1-\prod_i\sigma(-w_i\cdot x_i))
> Markov Models
- Random variable s_t \in {1,2,\ldots,n} = state at time t.
- Markov assumption about dynamics ("k-th order markov model"):
- P(s_t|s_1, s_2, \ldots, s_{t-1}) = P(s_t|s_{t-k}, \ldots, s_{t-1})
Strengths..
- simple to compute P(sequence)
- simple to estimate model parameters from data
Weaknesses
- to model kth order correlations, we need to store n^k parameters
- in the real world, we don't have direct access to state variable
- "partially observable worlds"
> Hidden Markov Models
# P(s,o) = P(s_1)\prod P(s_t|s_{t-1})\prod P(o_t|s_t)
Graphical model:
# [S_1 O_1 [S_2 O_2 [S_3 O_3 [... [S_t O_t]]]]]
Example: speech recognizer for the word "cat"
- state = phoneme
- observation = cat
- states = [silence, c, a, t, silence]
>> Parameters
Initial state probabilities:
# \pi_i = P(s_1=i)
Transition probabilities:
# a_{ij} = P(s_{t+1}=j|s_t=i)
Emission probabilities:
# b_{ik} = b_i(k) = P(o_t=k|s_t=i)
>> Computing the likelihood
Use decoder lattuces and forward and backward and viterbi etc..
[I've done this enough times that I don't want to write notes for it
again.. See manning & schutze for a good explanation.]
[03/20/02 10:30 AM]
> Bayes Nets
Two-component model. Qualitative component is a DAG: it defines
independence properties. Quantitative component is a "conditional
probability table" which lists the probability of a variable for each
combination of values of its *parents* in the DAG.
Bayes net = G + CPT. G is a DAG. CPT is a conditional probability
table.
Generally, the DAG represents causality. More formally, it represents
info about conditional independence..
>> Semantics
Always:
# P(X_1,\ldots,X_n) = \prod_i P(X_i|x_1\ldots X_{i-1})
Special: (*)
# P(X_1,\ldots,X_n) = \prod_i P(X_i|pa(X_i))
Where pa(X_i) are the parents of X_i in G.
Definition:
- \forall G \forall P, G "represents" P if P obeys factorization (*) given by G.
- "General" independence conditions specified by G?
>> Example:
B = burglary
A = arlarm
E = earthquake
J = john
M = mary
Graph:
# B\to A; E\to A
# A\to J; A\to M
CPT:
# B E | P(A=1)
# ------------
# 0 0 | \ldots
# 0 1 | \ldots
# 1 0 | \ldots
# 1 1 | \ldots
>> "Explaining away" and nonmonotonicity
There's some marginal probability of burglary:
# P(B=1) small
Now suppose that we observe the event that the alarm went off
# P(B=1|A=1) large
Now suppose that we additionally observe the event that there was an
earthquake:
# P(B=1|A=1,E=1) small
This last step, where P(B=1|\ldots) gets small again, is known as
"explaining away." We have a probabalistic explanation for the alarm,
and as a result, P(B=1|\ldots) gets small again.
This is a useful property -- gracefully incorperates the fact that our
beliefs can be revised on the basis of increasing evidence.
>> Causality
choose direction of DAG to match causality. This tends to give us
more efficient representations. But we can actually choose any
ordering. All that the bayes net represents is info about
correlations, not about causality. When interpreting a bayes net
graph, we have to be careful not to assume that it really asserts
causality..
>> Conditional independences
- node value is conditionally dependant, given parents, of all other
nodes.
Let X and Y be variables. S is a set of observed values.
S = {J=1, M=0}
When can we assert that X and Y are independant, given S? If there's
no (undirected) path from X\to Y that doesn't go through S.
>>> Blocking
# X{\textendash}\cdots\to Z \to\cdots{\textendash}Y
A path p btwn X and Y is "blocked" by S if there is a single node Z\in S
on the path and edges along path touching Z go in the *same*
direction. Here, we've observed an "intermediate cause".
# X{\textendash}\cdots\gets Z \to\cdots{\textendash}Y
A path p btwn X and Y is "blocked" by S if there is a single node Z\in S
on the path and edges along path touching Z go away from Z. Here,
we've observed a "common cause".
# X{\textendash}\cdots\to Z \gets\cdots{\textendash}Y
# /\
# / \
# /no S\
# /\_\_\_\_\_\_\
A path p btwn X and Y is "blocked" by S if there is a single node Z\notin S
on the path, and edges along path touching Z go towards Z, and no
descendants of Z are in S.
Say X and Y are d-separated by S in G if every path from X to Y is
blocked by S.
If X and Y are d-separated by S in G then X and Y are conditionally
independant. If X and Y are not d-separated by S in G then there
exists some P represented by G s.t. X and Y are conditionally
dependant.
> Inference in Bayes Nets
Input:
# Bayes net
# Set S of observations (e.g., {J=1,M=0})
# Variable X
Desired output:
# P(X|S)
Goal:
- tractable: running time polynomial time of
- inference
[03/25/02 10:34 AM]
Hm. Assignment 6 is due today. Huh. Ok. :) I guess I'll do that
tonight..
PROJECTS!! :)
>> Polytrees
Definition: if a DAG is a polytree, then it contains no *undirected*
cycles.
Today: efficient algorithm for inference in polytree Bayes nets
>>> Definitions
Node Q, our query node, has parents and children.
# \ldots \ldots \ldots
# U_1 U_2 U_3
# \ | /
# Q
# / | \
# V_1 V_2 V_3
# \ldots \ldots \ldots
Where U's parent's & children may be observed; and V's parents and
children may be observed; and U's and V's may be observed. Since it's
a polytree, U's are not connected to V's (except via Q). Furthermore,
U's are not connected to each other and V's are not connected to each
other.
Definition:
- S = set of evidence nodes
- \forall X,Y: S(X,Y) \subseteq S = set of evidence nodes reachable from X
avoiding Y.
For example, the evidence in U_1's blob is S(U_1,Q). The evidence in
V_i's blob is S(V_i,Q)..
Definitions:
- S^+ = \bigcup_i S(U_i,Q)
- S^- = \bigcup_i S(V_i,Q)
>>> Algorithm
Use message passing to distribute information.
# For every X\to Y:
# 1. X sends to Y: P(X=x,S(X,Y)) \forall x
# 2. Y sends to X: P(S(Y,X)|X=x) \forall x
X sends Y a message for each value of X, which gives the joint
probability of the setting of x and the probability of all evidence
connected to X (but not connected to Y).
Y sends X a message for each value of X, which gives the conditoinal
probability of the evidence connected to Y (but not X), given the
setting of X to the given value..
We still need to convince ourselves:
1. base case: leaf nodes can find message values
2. recursive case: non-leaf nodes can find message values
3. we can compute P(Q|S) given messages to Q.
Do #3 first (find P(Q=q|S)). First, reduce the conditional to a joint
probability:
# P(Q=q|S) = P(Q=q,S)/P(S) = \alpha P(Q=q,S)
Then find the joint probability:
# P(Q=q,S) = P(Q=q,S^+)P(S^-|Q=q, S^+)
Use d-separation (on Q, since for the right probability, Q is on the
right hand side of the conditional -- it's evidence (in this
circumstance, at least)) (this is d-sep case 1):
# P(S^-|Q=q, S^+) = P(S^-|Q=q)
So:
# P(Q=q,S) = P(Q=q,S^+)P(S^-|Q=q)
So we can find P(Q=q|S) if we can find:
# P(Q=q,S^+)
# P(S^-|Q=q)
For the first one:
# P(Q=q,S^+) = \sum_{ubar}P(Ubar=ubar,S^+)P(Q=q|Ubar=ubar,S^+)
Because it's a bayes net, we can reduce this to:
# P(Q=q,S^+) = \sum_{ubar}P(Ubar=ubar,S^+)P(Q=q|Ubar=ubar)
We can find P(Q=q|Ubar=ubar) in the CPTs.
By d-separation condition 3:
# P(Ubar=ubar,S^+) = \prod_i P(U_i=u_i,S(U_i,X))
So:
# P(Q=q,S^+) = \sum_{ubar}(\prod_i P(U_i=u_i,S(U_i,X)))P(Q=q|Ubar=ubar)
Get the left part from messages; and the right part from CPTs.
[03/27/02 10:36 AM]
! Note: new tutorial materials are now on the web page.
Review:
- Bayes net = DAG+CPTs
- Represent *any* P(X_1,\ldots,X_n)
- Order of decomposition matters!
- d-separation: if X,Y d-sep by S \Rightarrow P(X,Y|S)=P(X|S)P(Y|S)
- Problem of inference: compute P(X|S)
- Polytree algorithm: linear run time
>> Non-Polytree
What happens if the underlying DAG is not a polytree?
>>> Approach 1: make it a tree.
# season -> sprinkler -> wet
# season -> rain -> wet
# wet -> slippery
Node merging:
- Combine sprinkler & rain into a single 4-value variable.
- Everything that happens within the node is done exhaustively
- Size of merged node table is exponential
- Works well if you have a few small loops; not good with big loopy
networks.
- Exponential in the biggest merged node's size
Cut-set conditioning:
- Create 2 new graphs, by fixing the value of a node in the cycle
(e.g., season) to each possible value.
- Essentially "removes" the node from the graph -- eliminates cycle.
- Run polytree on new graphs (which are polytrees).
- Exponential in the cut-set size (number of nodes we "removed")
>>> Approach 2: Get as far from a tree as you can
- Polytrees: each node has a "small" number of "strong" influences
- Alternative: large numbers of weak influences
Consider for simplicity a 2-layer bayes net. Variables:
# u_1 u_2 ... u_n = U
# x_1 x_2 \ldots x_m = X
We have full connectivity from each u_i to each x_j.
We can think of U as "hidden causes" and X as "observable effects".
By d-separation on U's:
# P(X|U) = \prod_j P(x_j|U)
Marginal independance of the u's:
# P(U) = \prod_i P(u_i)
# P(U|X) \neq \prod_iP(U_i|X)
Parametric representations..
- Give a weight vector to each node
- Use weight vectors to give a compact representation of the CPTS:
P(x_i|U). Otherwise, these would be exponential.
- Assume P(X=1|U) = \sigma(\theta\cdot U)
Think of the weights as being roughly uniform (e.g., within 2 orders
of magnitude of each other)
- limits the influence of individual inputs.
This is a standard definition of a neural network!
Possible definitions for \sigma-esqe funcitons include:
- sigmoid \sigma = (1+exp(-z))^{-1}
- noisy-or \oplus = 1-e^{-z}
(Assume each input bias p_i=1/2, just for convenience -- we can handle
the general case)
P(x_i=1,\ldots,x_m=1) = (1/2)^n\sum_U\prod_i\oplus(U)
[04/01/02 10:39 AM]
\sigma(z) = 1-e^{-z}
We don't like 1-exponentials, so upper bound it with an exponential,
depending on what part of the 1-exp we're looking at:
# log(1-e^{-z}) \leq \lambda z-g(\lambda)
# 1-e^{-z} \leq e^{\lambda z}-g(\lambda)
# g(\lambda) = (\lambda+1)log(\lambda+1) - \lambda log\lambda
[04/08/02 09:36 AM]
> Game Theory
>> Where does this fit?
Previous topics included:
- reinforcement learning: learning how to act in an uncertain
environment
- unsupervised learning & bayesian networks: probablistic modeling
and inference
Both were passive environments ("nature").
Game theory is a reasonable starting point to look at problems of
interaction.
>> Single-stage matrix games
- Players i=1, \ldots, n (start with n=2)
- Each player has actions a_1\ldots a_k (start with k=2, actions={0,1})
- Payoff matrices (1 per player): matrix M_i for player i that is
indexed by the actions of the 2 players.
# M_i(x_1,x_2) x_1,x_2\in{0,1}
Specifies the payoff to player i if player 1 plays x_1 and player
2 plays x_2.
- Moves are simultaneous
>>> Example: Prisoner's dilema (confess/deny)
# M_1 M_2
# x_1=c x_1=d x_1=c x_1=d
# x_2=c -5 -10 x_2=c -5 0
# x_2=d 0 -1 x_2=d -10 -1
Can also write as:
# x_1=c x_1=d
# x_2=c (-5,-5) (0,-10)
# x_2=d (-10,0) (-1,-1)
A "stable solution" is one where neither player has any incentive to
change their action.
Stable solution = (confess, confess)
>>> Example: Penny matching (heads/tails)
# x_1=h x_1=t
# x_2=h (1,0) (0,1)
# x_2=t (0,1) (1,0)
Player 1 wins if the moves agree; Player 2 wins if the moves
disagree. This is a "zero-sum" game (well, actually it's
"constant-sum" game, but for strategic purposes, they're the same.
No stable entry. But there is a stable strategy:
- both players flip a fair coin.
>>> Mixed Strategies
Define p_i \triangleq Pr[x_i=1] be a randomized strategy for player i
Define M_i(p_1,p_2) = E(M_i(x_1,x_2)) where x_1 is drawn from p_1, x_2
from p_2.
(p_1,p_2) is a "mixed strategy" for the game ("mixed"=probabilistic).
# M_1(p_1,p_2) = (1-p_1)(1-p_2)M_1(0,0) *
# p_1p_2M_1(0,1) *
# (1-p_1)(1-p_2)M_1(1,0) *
# p_1p_2M_1(1,1)
If we fix p2, then M_1(p_1,p_2) is a linear function of p_1. So if
p_2 is fixed, there are 3 possible ways the function can look:
positive slope, negative slope, zero slope:
- if positive slope: play fixed action 1
- if negative slope: play fixed action 0
- if zero slope: play anything
>>> Nash Equilibrium
Definition: a mixed strategy (p_1,p_2) is a "Nash Equilibrium" for the
matrix game (M_1,M_2) if two symmetric conditions hold:
- M_1(p_1,p_2) \geq M_1(p_1',p_2) \forall p_1'\in[0,1]
- M_2(p_1,p_2) \geq M_2(p_1,p_2') \forall p_2'\in[0,1]
So neither player has a unilateral incentive to deviate.
>>> Example: Correlated equilibria (go/stop)
# x_1=g x_1=s
# x_2=g (-10,-10) (0,-1)
# x_2=s (-1,0) (-1,-1)
>> Nash's Theorem
For any matrix game (M_1,M_2) there exists a (possibly mixed) Nash
equilibrium. Holds for any n.
>>> Sperner Labeling
- Consider an equilateral triangle
- Triangulate into smaller equilateral triangles
- segment with 5 horiz sections, then segment each section with
smaller equilateral triangles
# \triangle
# \triangle\triangle
# \triangle\triangle\triangle
# \triangle\triangle\triangle\triangle
# \triangle\triangle\triangle\triangle\triangle
A sperner labeling:
- Puts a 0, 1, or 2 label at ever vertex of the segmented triangle
- exterior corners are labeled 0, 1, and 2.
- Vertices on exterior sides can not have the label from the
opposite corner.
- Internal vertices get any label
# 0
# /\
# / \
# 1--0
# /\ /\
# / \/ \
# 0---1--0
# /\ /\ /\
# / \/ \/ \
# 1--2---1---0
# /\ /\ /\ /\
# / \/ \/ \/ \
# 0---2--0---1---2
# /\ /\ /\ /\ /\
# / \/ \/ \/ \/ \
# 1---1---2---1---1--2
Sperner's Lemma: any sperner labeling results in at least one small
triangle with veritces labeled (0,1,2).
Intuition:
- Think of each triangle as a room
- Vertex is any (0,1) edge.
- Since we always move through (0,1) edges, we must be in either a
(0,1,0), a (0,1,1), or a (0,1,2) triangle. If it's (0,1,2), we're
done. Otherwise, go through the other "door". Keep going through
"doors" until we get to a (0,1,2) triangle.
Proof:
- First, add edges on (0,1) side of big triangle.
- (ensures that there is exactly 1 exterior "door" = (0,1) edge)
- Enter the triangle containing an exterior (0,1) edge
- If not in a (0,1,2) triangle, but entered a triangle via a (0,1)
edge, then enter the other adjancent triangle across a (0,1)
edge.
- We will never re-visit any triangle.
- Consider the first triangle that we revisit
- We must have entered from either the place that we entered or
the place that we left. But then we must have revisited that
triangle. So this wasn't the first triangle we revisited. But
that's a contradiction.
- We can't "leave" the castle, since there is only one exterior
(0,1) edge.
Brower's Fixed Point Theorem: For any convex region and a function
that maps points from that region to that region, then it has a fixed
point.
>>> Brouwer's Fixed Point Theorem
- Let S be any (n-dimensional) simplex (e.g., S=equilateral triangle
in \setR)
- Let \phi: S\to S be any continuous mapping
- Then \exists x s.t. \phi(x)=x
Proof sketch (for special case of the equilateral triangle):
- Let T_1, T_2, \ldots, T_n, \ldots be a nested sequence of successively finer
triangulations.
- Define labeleings L_1, L_2, \ldots, L_n, \ldots for each triangulation T_n.
- Triangle labeling:
- Let x be an interior edge that we're trying to label, which is a
subtriangulation of T.
- Draw a ray from x through \phi(x).
- Each edge of the outer triangle determines a single label
- Label x with the label for the (outer) edge that the ray passes
through
- Break ties (at vertices) in favor of lowest label
Hm, this might not be quite right. But the ideas is that we're
supposed to be able to define a labeling that depends on \phi(x) that's a
sperner labeling.
- So in every labeling L_n under \phi, \exists a small (0,1,2) triangle.
- Let x_i be the center of a (0,1,2) triangle in L_i
- (from real analysis) the infinite sequence x_1, x_2, \ldots, x_n, \ldots has
an infinite convergant subsequence with limit x.
Claim: \phi(x) = x.
- intuition: x is the center of an arbitrarily small 0-1-2 triangle.
- in a 0-1-2 triangle, \phi maps the corners (which are close to each
other) in very different directions: the angles spanned by the
corner vectors are \geq \pi/3.
- continuity says that if we pick a small enough region R around x,
then the size of the corresponding \phi(R) must be small
- so R must overlap \phi(R).
- Make R arbitrarily small, and x=\phi(x).
[04/17/02 09:34 AM]
> Graphical Games
Players 1, \ldots, n with joint action xbar and mixed strategy pbar.
# digraph ggame { size="4,2";rankdir=LR
# 1 -> 2 -> 4 -> 6 -> 7 -> 8
# 1 -> 3 -> 4 -> 5 }
Represent using local payoffs:
Size is exponential in the max degree, not in n.
>> Tree
Assume that G is a tree. Then we can define an efficient algorithm..
# digraph tree { size="4,2";
# U [color=red]; A [color=red]; B [color=red]; C [color=red]
# X -> V;Y -> V
# U -> V;A->B->U;C->B }
Up_G(U) = set of all nodes "upstream" from U, including U
G^U = graph induced by Up_G(U)
i.e., just the local game matrices in all nodes above U. But we need
to change the game matrix for U: it expects an input from V. Fix a
value for it (i.e., marginalize V's input out of U's matrix).
So now we've created a new smaller graphical game. It has a nash
equilibrium (by Nash's theorem). Call this an "upstream nash
equilibrium from U".