Understanding Graphs and Hidden Markov Models

By Shilpa Ananth

Note: A lot of the content here is based on my learnings from Bishop, lectures by Jeff Miller (highly recommend), and several other resources linked in references below

A graph comprises nodes (also called vertices) connected by links (also known as edges or arcs). In a probabilistic graphical model, each node represents a random variable (or a random vector), and the links express probabilistic relationships between these variables. It captures the way in which the joint distribution over all of the random variables can be decomposed into a product of factors each depending only on a subset of the variables

Section 1: "Bayesian" networks / Directed Graphical models/ and Joint PDFs

“Bayesian” networks are useful tools to visualise the random variables and their relationship as a graph. They are useful because you can (a) Factorization of the probability distribution into conditionals can visualized (b) The conditional independence properties become more evident. Visualizing the models also helps us make simplified inferences by leveraging the conditional independence properties of the graph.

Section 1.1: Graph factorization

Note 1: Shaded random variables indicate conditioning. (these variables are observed). RV's that do not have observed values are often called latent variables or hidden variables.

  1. p(a,b,c)=p(ca,b)p(a,b)=p(ca,b)p(ab)p(b)p(a,b,c) = p(c|a,b)p(a,b) = p(c|a,b)p(a|b)p(b)
  1. p(a,b,c,d)=p(da,b,c)p(ca,b)p(ba)p(a)p(a,b,c,d) = p(d|a,b,c)p(c|a,b)p(b|a)p(a)
💡

For a fully connected DAG graph with K random variables, we define

p(x1,x2,xk)=p(xkx1,x2,xk1)p(x2x1)p(x1)p(x_1, x_2, … x_k) = p(x_k|x_1, x_2, … x_{k-1}) … p(x_2|x_1)p(x_1)

In the absence of certain links, the distribution is modified accordingly. For figure 3 on the right above, the joint distribution is given by

p(x1,x2,x3,x4,x5)=p(x1)p(x2)p(x3)p(x4x1,x2,x3)p(x5x1,x3)p(x6x4)p(x7x4,x5)p(x_1, x_2, x_3, x_4, x_5) = p(x_1) p(x_2) p(x_3) p(x_4|x_1, x_2, x_3) p(x_5|x_1, x_3) p(x_6|x_4) p(x_7|x_4, x_5)

In general, given an ordered DAG on n vertices, the JPDF can be written as p(x)=i=1kp(xpax)p(\textbf{x}) = \prod_{i=1}^{k} p(x|pa_{\textbf{x}})subject to constraints that there are no cycles.

Note 2: In the JPDF factorization above, we can also denote the whole JPDF in terms of the JPDF of a sub-graph.

p(x1,x2,x3,x4,x5)=p(x6x4)p(x7x4,x5)p(x1,x2,x3,x4,x5)p(x_1, x_2, x_3, x_4, x_5) = p(x_6|x_4) p(x_7|x_4, x_5) p(x_1, x_2, x_3, x_4, x_5)

Note 3: For any given graph, the set of distributions that satisfy the graph will include any distributions that have additional independence properties beyond those described by the graph. For instance, a fully factorized distribution will always be passed through the filter implied by any graph over the corresponding set of variables.

Section 1.2: Some Practical Visualization Examples

Polynomial regression:

The random variables in this model are the vector of polynomial coefficients w\textbf{w} and the observed data t=(t1,tN)\textbf{t} = (t_1, … t_N). In addition, this model contains (x1,xN)(x_1, … x_N) i.e., the input data, the noise variance 𝜎2𝜎^2, and the hyperparameter α representing the precision of the Gaussian prior over w\textbf{w}, all of which are parameters of the model rather than random variables.

The figure below shows the simple graph structure of random variables; the plate diagram and the fully descriptive graph structure with the deterministic parameters. The input and parameters are denoted without circles around them. Note that xnx_n is not a ‘variable’. The repetitive variables are condensed into a ‘plate’
The joint distribution
p(t,w)p(\textbf{t},\textbf{w}) given the parameters of polynomial regression is given by p(t,wx,α,σ2)=p(xα)n=1Np(tnw,xn,σ2)p(\textbf{t}, \textbf{w} | x, \alpha, \sigma^2) = p(x|\alpha) \prod_{n=1}^{N} p(t_n|\textbf{w}, x_n, \sigma^2). Note that the value of w\textbf{w} is not observed, and so w\textbf{w} is an example of a latent variable, also known as a hidden variable.
When t values are observed, we can calculate the posterior
p(wt)p(\textbf{w}|\textbf{t}) proportional to p(w,t)p(\textbf{w},\textbf{t}).

p(wt)=p(t,w)/p(t)p(w|t) = p(t,w)/p(t)

Section 1.3: Conditional Independence

Given three nodes of a graphical model a, b, c; a is said to be independent of b given c if it satisfies the following condition p(ab,c)=p(ac)p(a|b,c) = p(a|c) and p(a,bc)=p(ab)p(ac)p(a,b|c)=p(a|b)p(a|c). To interpret conditional independence from graphical models we use a concept called d-separation.

  1. When the node c between a and b is tail to tail connected; it blocks the path when c is observed, making the variables a and b conditionally independent

    p(a,bc)=p(a,b,c)p(c)=p(c)p(ac)p(bc)p(c)=p(ac)p(bc)a ⁣ ⁣ ⁣bcp(a,b|c) = \frac{p(a,b,c)}{p(c)} = \frac{p(c)p(a|c)p(b|c)}{p(c)} = p(a|c)p(b|c) \to a \perp \!\!\! \perp b | c
  1. When the node c between a and b is head to tail connected; it blocks the path when c is observed, making the variables a and b conditionally independent

    p(a,bc)=p(a,b,c)p(c)=p(a)p(ca)p(bc)p(c)=p(a)p(a,c)p(a)p(bc)p(c)=p(ac)p(c)p(bc)p(c)=p(ac)p(bc)p(a,b|c) = \frac{p(a,b,c)}{p(c)} = \frac{p(a)p(c|a)p(b|c)}{p(c)} = \frac{p(a)\frac{p(a,c)}{p(a)}p(b|c)}{p(c)} = \frac{p(a|c)p(c)p(b|c)}{p(c)} = p(a|c)p(b|c) 
    Therefore,
    a ⁣ ⁣ ⁣bc\to a \perp \!\!\! \perp b | c
  1. When the node c between a and b is head to head connected; it opens the path when c is observed, making the variables a and b conditionally dependent

In a DAG, if all the paths from A to B are blocked given C, A is said to be d-separated from B given C, and it satisfies the condition A⊥B | C. The Markov blanket of a node xix_i comprises the set of parents, children and co-parents of the node. We can think of the Markov blanket of a node xix_i as being the minimal set of nodes that isolates xix_i from the rest of the graph. Note that it is not sufficient to include only the parents and children of node xix_i because the phenomenon of explaining away means that observations of the child nodes will not block paths to the co-parents.

Section 1.4: d-separation

Consider 3 disjoint set of vertices (A, B, C) in a DAG. A ‘path’ is defined as a set of vertices such that there is an edge between the vertices. The path between A and B is said to be “blocked” wrt if
(a) There exists a head to tail vertex or a tail to tail vertex in that path that belongs to C; or
(b) There exists a head to head vertex in that path where that vertex (and all its descendents) are not in C.

A and B are D-separated by C if all paths between A to B are ‘blocked’ by C; and in this case A ⁣ ⁣ ⁣BCA\perp\!\!\!\perp B | C (i.e., when C is observed)

Section 2: Hidden Markov Models

Section 2.1: Markov chains

In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of events in which the probability of each event depends only on the state attained in the previous event. These Markov Chains can be continuous time or discrete time; a continuous time Markov Chain is often referred to as a “Markov process”.

Through the graph and the d-separation properties described above, we see indeed that when a particular node or vertex is observed, the next node becomes independent of any other past states. The joint PDF can be written as p(x)=p(xnxN1)p(xN1xN2)p(x2x1)p(x1)p(\textbf{x}) = p(x_n|x_{N-1})p(x_{N-1}|x_{N-2})…p(x_2|x_1)p(x_1). The value of p(xNxN1)p(x_N|x_{N-1}) can be given by a simple state transition probability matrix (if discrete) or a more complex PDF (eg Gaussian) if continuous.

Some examples of Markov chains include (a) Weather Prediction: The weather tomorrow depends only on today's weather, not on the weather from two days ago (b) Next position on a dice board game depends only on current position. (c) Stock price movement (continuous) (d) Brownian motion etc.

As an extension, a second order Markov chain is one that depends on the previous two observations instead of just the previous observation.

Section 2.2: Hidden Markov Models

Markov models make an assumption that all variables of the process are evident to us or “observed”. However, in most scenarios, there is a latent (unobserved/hidden) variable that also affects the current observation. In the example of the weather pattern, consider a latent variable that could be the intrinsic “season” that may also influence the next days weather along with the current day. The key idea here is that the hidden variables follow a first order markov process. Notice that no observation is d-separated from any other observation. There is always an unblocked path between any two observations because all observations are connected via the latent variables. But each observation is conditionally independent of every other observation given the value of its associated latent variable.

Note 4: The observed variables maybe discrete or continuous, and the latent variables are usually discrete. In a lot of cases, the number of states of these latent variables may not always be evident.

The observed variables are denoted by xkR/Rdx_k \in {\R / \R^d} and the latent variables are denoted by zk1,2,...mz_k \in {1, 2, ... m}.

  1. The joint PDF is given by p(x1,xN,z1,zN)=p(z1)p(xz1)k=2Np(zkzk1)p(xkzk)p(x_1, … x_N, z_1, … z_N) = p(z_1)p(x|z1) \prod_{k=2}^{N} p(z_k|z_{k-1})p(x_k|z_k).
  1. The values for p(xkzk)p(x_k|z_k) are called emission probabilities governed by ϵi(xk)=p(xkzk=i)\epsilon_i(x_k) = p(x_k|z_k=i) where ϵi\epsilon_i is a probability distribution on xx (a PDF or a PMF).
  1. Notice the JPDF of the latent variables p(z)=p(z1)n=1Np(zn+1zn)p(\textbf{z}) = p(z_1)\prod_{n=1}^{N}p(z_{n+1}|z_n). The values for p(zk1zk)p(z_{k-1}|z_k) are called transition probabilities governed by a transition matrix TT where T(i,j)=p(zk1=izk=j)T(i, j)=p(z_{k-1}=i|z_k=j). Additionally, p(z1)p(z_1) is called the probability of the initial state

Section 2.3: The Forward Backward algorithm

A key part in working with HMMs is the forward-backward algorithm that uses dynamic programming. Assume that the p(xkzk),p(zk1zk),p(zk)p(x_k|z_k), p(z_{k-1}|z_k), p(z_k) are known, we want to compute p(zkx)p(z_k|\textbf{x}). We know that the joint PDF is given by 1 above

p(zkx)p(x,zk)p(z_k|\textbf{x}) \propto p(\textbf{x}, z_k)

p(zkx)p(x1:k,xk+1:n,zk)p(z_k|\textbf{x}) \propto p(x_{1:k}, x_{k+1:n}, z_k)

p(zkx)p(xk+1:Nzk,x1:k)p(zk,x1:k)p(z_k|\textbf{x}) \propto p(x_{k+1:N}|z_k, x_{1:k}) p(z_k, x_{1:k}) \\ P(A,B)=P(AB)P(B)P(A, B) = P(A|B)P(B)

Given zkz_k, xk+1:Nx_{k+1:N} is independent of x1:kx_{1:k} since it is d-separated by zkz_k

p(zkx)p(xk+1:Nzk)p(zk,x1:k)p(z_k|\textbf{x}) \propto p(x_{k+1:N}|z_k) p(z_k, x_{1:k})
Backward Forward

Using this extimate for p(zkx)p(z_k|\textbf{x}), we can either (a) estimate “inference” probabilities, (b) Sample from the posterior (c) In the Baum-Welsch algorithm for estimating the HMM parameters.

Forward algorithm

The forward algorithm is used to compute α(zk)=p(zk,x1:k)\alpha(z_k) = p(z_k, x_{1:k}). To do this we try to incorporate both zk1z_{k-1} to bring in the transition probabilities and xkx_k to bring in the emission probabilities.

α(zk)=zk1=1Mp(zk1,zk,x1:k)\alpha(z_k) = \sum\limits_{z_{k-1}=1}^{M} p(z_{k-1},z_k, x_{1:k}) \\ Inc. and marg over zk1z_{k-1} that has M possible states

α(zk)=zk1=1Mp(zk1,zk,xk,x1:k1)\alpha(z_k)= \sum\limits_{z_{k-1}=1}^{M} p(z_{k-1},z_k, x_k, x_{1:k-1}) \\ splitting x1:kx_{1:k}

α(zk)=zk1=1Mp(xkzk,zk1,x1:k1)p(zk,zk1,x1:k1)\alpha(z_k)= \sum\limits_{z_{k-1}=1}^{M} p(x_k|z_k, \cancel{z_{k-1}}, \cancel{x_{1:k-1}}) p(z_k, z_{k-1}, x_{1:k-1}) \\ P(A, B) = P(A|B)P(B)

α(zk)=zk1=1Mp(xkzk)p(zkzk1,x1:k1)p(zk1,x1:k1)\alpha(z_k)= \sum\limits_{z_{k-1}=1}^{M} p(x_k|z_k) p(z_k | z_{k-1}, \cancel{x_{1:k-1}}) p(z_{k-1}, x_{1:k-1}) \\ P(A, B) = P(A|B)P(B)

💡

α(zk)=zk1=1Mp(xkzk)p(zkzk1)α(zk1)\alpha(z_k)= \sum\limits_{z_{k-1}=1}^{M} p(x_k|z_k) p(z_k | z_{k-1}) \alpha(z_{k-1}) for k=2,3,nk = 2, 3, …n

α(z1)=p(z1,x1)=p(x1z1)p(z1)\alpha(z_1) = p(z_1, x_1) = p(x_1|z_1)p(z_1)

Without DP, the cost of computing all α(zk)\alpha(z_k) = M (for each possible state of zkz_k) x M (summation in the equation for zk1z_{k-1}) x N (number of observations) \to Θ(M2N)\Theta({M^2N})

Backward algorithm

β(zk)=p(xk+1:Nzk)\beta(z_k)= p(x_{k+1:N}|z_k)

β(zk)=zk+1=1Mp(xk+1:N,zk+1zk)\beta(z_k) = \sum\limits_{z_{k+1}=1}^{M} p(x_{k+1:N}, z_{k+1}|z_k)

β(zk)=zk+1=1Mp(xk+1,xk+2:N,zk+1zk)\beta(z_k) = \sum\limits_{z_{k+1}=1}^{M} p(x_{k+1}, x_{k+2:N}, z_{k+1}|z_k)

β(zk)=zk+1=1Mp(xk+2:Nzk+1,zk,xk+1)p(xk+1zk+1,zk)p(zk+1zk)\beta(z_k) = \sum\limits_{z_{k+1}=1}^{M} p(x_{k+2:N}| z_{k+1}, \cancel{z_k}, \cancel{x_{k+1}}) p(x_{k+1}| z_{k+1}, \cancel{z_k}) p(z_{k+1}|z_k)

\\ Chain rule : P(A, B, C, D) = P(A | B, C, D) P(B | C, D) P(C | D) P(D)

💡

β(zk)=zk+1=1Mβ(zk+1)p(xk+1zk+1)p(zk+1zk)\beta(z_k) = \sum\limits_{z_{k+1}=1}^{M} \beta(z_{k+1}) p(x_{k+1}| z_{k+1}) p(z_{k+1}|z_k) for k=1,2,..n1k = 1, 2, .. n-1

β(zn)=1\beta(z_n) = 1

Section 2.4: The Viterbi algorithm

The Viterbi algorithm is used to determine the most likely (path of) hidden variable states given the entire sequence of observations. Let’s call z\textbf{z}^* (Note that z=z1,z2,..zn\textbf{z} = {z_1, z_2, ..z_n}) the most likely sequence of z’s given all the observations x=x1,x2,...xn\textbf{x} = {x_1, x_2, ... x_n}.

To compute z\textbf{z}^*, we need to estimate arg maxzp(zx)\argmax\limits_{\textbf{z}} p(\textbf{z}|\textbf{x})

arg maxzp(zx)=arg maxzp(z,x)p(x)=arg maxzp(z,x)=arg maxz1:Np(z1:N,x1:N)\argmax\limits_{\textbf{z}} p(\textbf{z}|\textbf{x}) = \argmax\limits_{\textbf{z}} \frac{p(\textbf{z},\textbf{x})}{p(\textbf{x})} = \argmax\limits_{\textbf{z}} p(\textbf{z},\textbf{x}) = \argmax\limits_{z_{1:N}} p(z_{1:N},x_{1:N})

To break this down recursively (to use dynamic programming), we define a

Vk(zk)=arg maxz1:k1p(z1:k,x1:k)V_k(z_k)= \argmax\limits_{z_{1:k-1}} p(z_{1:k},x_{1:k})

Vk(zk)=arg maxz1:k1p(xk,zk,x1:k1,z1:k1)V_k(z_k)= \argmax\limits_{z_{1:k-1}} p(x_k, z_k, x_{1:k-1}, z_{1:k-1})

Vk(zk)=arg maxz1:k1p(xk,zkx1:k1,z1:k1)p(x1:k1,z1:k1)V_k(z_k)= \argmax\limits_{z_{1:k-1}} p(x_k, z_k | x_{1:k-1}, z_{1:k-1})p(x_{1:k-1}, z_{1:k-1}) \\ P(A, B) = P(A|B)P(B)

Vk(zk)=arg maxz1:k1p(xkzk,x1:k1,z1:k1)p(zkx1:k1,z1:k1)p(x1:k1,z1:k1)V_k(z_k)= \argmax\limits_{z_{1:k-1}} p(x_k| z_k, \cancel{x_{1:k-1}}, \cancel{z_{1:k-1}})p( z_k | \cancel{x_{1:k-1}}, z_{1:k-1})p(x_{1:k-1}, z_{1:k-1})

\\ Chain rule : P(A, B | C) = P(A | B, C) P(B | C), terms cancelled due to markov assumption

Vk(zk)=arg maxz1:k1p(xkzk)p(zkz1:k1)p(x1:k1,z1:k1)V_k(z_k)= \argmax\limits_{z_{1:k-1}} p(x_k| z_k)p( z_k |z_{1:k-1})p(x_{1:k-1}, z_{1:k-1})

Vk(zk)=arg maxzk1p(xkzk)p(zkz1:k1)arg maxz1:k2p(x1:k1,z1:k1)V_k(z_k)= \argmax\limits_{z_{k-1}} p(x_k| z_k)p( z_k |z_{1:k-1})\argmax\limits_{z_{1:k-2}}p(x_{1:k-1}, z_{1:k-1})

💡

Vk(zk)=arg maxzk1p(xkzk)p(zkz1:k1)Vk1(zk1)V_k(z_k)= \argmax\limits_{z_{k-1}} p(x_k| z_k)p( z_k |z_{1:k-1})V_{k-1}(z_{k-1})
Where
V1(z1)=p(x1z1)p(z1)V_1(z_1) = p(x_1|z_1)p(z_1)

Finally, to link back, arg maxz1:Np(z1:N,x1:N)=arg maxzNVN(zN)\argmax\limits_{z_{1:N}} p(z_{1:N},x_{1:N}) = \argmax\limits_{z_N} V_N(z_N)

More on applications and practical guides in another post!


[1] Bishop - Pattern Recognition And Machine Learning

[2] 🔗 YouTube - Pattern Recognition & Machine Learning

[3] 🔗 Hidden Markov Models - Gregory Gundersen

[4] 🔗 Viterbi Algorithm for HMM - Medium

Tags: ML ML Theory