Markov switching dynamic autoregression

Hi all,

I have been trying to make a Markov switching dynamic autoregression model within pymc3 but I am stuck as I am unfamiliar with theano. I am attaching a model similar in theme to what I want in statsmodels:
http://www.statsmodels.org/stable/examples/notebooks/generated/markov_regression.html?highlight=markov%20switching

I have created and run hidden markov models in pymc3 based upon similar models:

but the definition of time dependent transition matrix with thensors stumps me. Can someone please give an hint as to how to proceed for this?

Thanks,
Anurag

I have a regime switching model implementation you can have a look: https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/Ports/Regime-switching%20models%20in%20PyMC3.ipynb

3 Likes

Thanks junpenglao.

This was really helpful and I could build the non-theano version by myself but it was more difficult for me to understand how this was done with theano. So thank you.

My question is related to the modification of the code when the matrix P is not constant but changing with time as well. As a theano newbie, it is hard to see how that can be done with the scan function. Could you give a hint?

Thanks,
Anurag

Depending on how it is changing with time - you can either move P from non_sequences to outputs_info (it will get modify in ft_xit_dt and become one of the output, if the change over time is deterministic), or move it into sequences like Eta, which it is a same length theano array that you constructed outside of the scan.

1 Like

Thanks Junpenglao. That worked. Just for posterity, I am pasting the main modifications to the code here:

def ft_xit_dt(Eta, p_curr, ft, Xi):
P = tt.stack(tt.stack([p_curr[0], 1-p_curr[0]]), tt.stack([1-p_curr[1], p_curr[1]]))


Xi_ = tt.shape_padleft(Xi)
xit0 = tt.stack([Xi_, 1 - Xi_], axis=1).T
ft = tt.sum(tt.dot(xit0 * P, Eta))
Xi1 = (P[0, 0] * Xi + P[1, 0] * (1 - Xi)) * Eta[0] / ft
return [ft, Xi1]

and

with pm.Model() as m:
    # Transition matrix
    BoundedNormal = pm.Bound(pm.Normal, lower=-30, upper=0)
    BoundedNormal2 = pm.Bound(pm.Normal, lower=0, upper=0.5)
    logalpha = BoundedNormal('logalpha', mu=0, sd=10)
    beta =  BoundedNormal2('beta', mu=0, sd=0.5)
    p2 = pm.Beta('p2', alpha=10., beta=2.)
    p_time = []
    for i in range(1000):
        p1 = (1-tt.exp(logalpha + beta*i/20))
        p_curr = tt.stack([p2, p1])
        p_time.append(p_curr)
    p = tt.stack(p_time, axis=0)   
    # eta
    alpha1 = pm.Normal('alpha1', mu=0., sd=1)
    alpha2 = pm.HalfNormal('alpha2',sd=1)
    sigma = pm.HalfCauchy('sigma', beta=1., shape=2)
    eta1 = tt.zeros(yshared.shape)
    eta1 = tt.exp(pm.Normal.dist(mu=alpha1, sd=sigma[0]).logp(yshared))

    eta2 = tt.zeros_like(eta1)
    eta2 = tt.set_subtensor(eta2[0], 1)
    eta2 = tt.set_subtensor(eta2[1:], tt.exp(
        pm.Normal.dist(mu=alpha2 + yshared[:-1], sd=sigma[1]).logp(yshared[1:])))

    eta = tt.stack([eta1, eta2], axis=1)
    
    xi_init = pm.Beta('xi_init', alpha=2., beta=2.)
    ft_out = theano.shared(0.) # place holder
    ([ft, xi], updates) = theano.scan(ft_xit_dt,
                                      sequences=[eta, p],
                                      outputs_info=[ft_out, xi_init])

    Xi = pm.Deterministic('Xi', xi)
    # likelihood `target += sum(log(f))`
    pm.Potential('likelihood', tt.sum(tt.log(ft)))

This works for a single emission sequence pretty well.

However, I have the following issue:
I am trying to train of 300-400 sequences with a maximum of 300 or so emissions per sequence. I am able to build the model similarly but it takes a huge amount of time to start up and RAM requirements are huge too.

Is this expected? Is there anyway around this?

Thanks again,
Anurag

1 Like

It is somewhat expected because the use of for loop to construct p - I think for performance you would need to rewrite it into a theano.scan as well

Thanks Junpeng. It reduced memory footprint massively (by 60%ish) and even computational speed (by 1/3rd-ish) when I used theano.scan to build the p matrix. Now, I am wondering if there is a better way to do multiple chains (in the most general case, chain lengths can vary too) as well. So I am putting my code here:

def ft_xit_dt(Eta, p_curr, ft, Xi):
    P = tt.stack(tt.stack([p_curr[0], p_curr[1], p_curr[2]]),
                 tt.stack([p_curr[3], p_curr[4], p_curr[5]]),
                 tt.stack([p_curr[6], p_curr[7], p_curr[8]]))
    P = tt.set_subtensor(P[0, 0], 1 - P[0, 1] - P[0, 2])
    Xi1 = (P[0, 0] * Xi[0] + P[1, 0] * Xi[1] + P[2, 0] * Xi[2]) * Eta[0]
    Xi2 = (P[0, 1] * Xi[0] + P[1, 1] * Xi[1] + P[2, 1] * Xi[2]) * Eta[1]
    Xi3 = (P[0, 2] * Xi[0] + P[1, 2] * Xi[1] + P[2, 2] * Xi[2]) * Eta[2]
    ft = Xi1 + Xi2 + Xi3
    Xi_ = tt.stack([Xi1, Xi2, Xi3])/ft
    return [ft, Xi_]


def step(m_row, cumulative_product):
    return m_row * cumulative_product

and

    with pm.Model() as m:
    # Transition matrix
    BoundedNormal = pm.Bound(pm.Normal, lower=-30, upper=0)
    BoundedNormal2 = pm.Bound(pm.Normal, lower=0, upper=0.5)
    logalpha = BoundedNormal('logalpha', mu=0, sd=10)
    beta = BoundedNormal2('beta', mu=0, sd=0.5)

    BoundedNormal3 = pm.Bound(pm.Normal, lower=0, upper=0.1)
    p1 = BoundedNormal3('p1', sd=1)
    p2 = pm.Dirichlet('p2', a=np.array([1, 10, 1]))
    p3 = pm.Dirichlet('p3', a=np.array([1, 1, 10]))

    # eta
    alpha = pm.Normal('alpha', mu=0., sd=1, shape=3)
    sigma = pm.HalfCauchy('sigma', beta=1., shape=3)
    xi_init = np.array([0.7, 0.3, 0])
    # pm.Dirichlet('xi_init', a=np.array([3, 1, 0]))
    potential = tt.dscalar('potential')
    potential = 0
    for (i, curr_row) in df.iterrows():
        data = curr_row[curr_row.notna()].values
        yshared = theano.shared(data)
        n = data.shape[0]
        eta1 = tt.zeros_like(yshared)
        eta1 = tt.set_subtensor(eta1[0], tt.exp(pm.Normal.dist(
            mu=alpha[0], sd=sigma[0]).logp(yshared[0])))
        eta1 = tt.set_subtensor(eta1[1:],
                                tt.exp(pm.Normal.dist(
                                    mu=alpha[0] + yshared[:-1], sd=sigma[0])
                                       .logp(yshared[1:])))
        eta2 = tt.zeros_like(eta1)
        eta2 = tt.set_subtensor(eta2[0], tt.exp(pm.Normal.dist(
            mu=alpha[1], sd=sigma[1]).logp(yshared[0])))
        eta2 = tt.set_subtensor(eta2[1:], tt.exp(
            pm.Normal.dist(mu=alpha[1] + yshared[:-1],
                           sd=sigma[1]).logp(yshared[1:])))
        eta3 = tt.zeros_like(eta1)
        eta3 = tt.set_subtensor(eta3[0], tt.exp(pm.Normal.dist(
            mu=alpha[2], sd=sigma[1]).logp(yshared[0])))
        eta3 = tt.set_subtensor(eta3[1:], tt.exp(
            pm.Normal.dist(mu=alpha[2] + yshared[:-1],
                           sd=sigma[2]).logp(yshared[1:])))
        eta = tt.stack([eta1, eta2, eta3], axis=1)
        M1 = tt.stack([tt.ones_like(eta1)*tt.exp(beta/4),
                       tt.ones_like(eta1),
                       tt.ones_like(eta1),
                       tt.ones_like(eta1),
                       tt.ones_like(eta1),
                       tt.ones_like(eta1),
                       tt.ones_like(eta1),
                       tt.ones_like(eta1),
                       tt.ones_like(eta1)
                       ],
                      axis=1)
        M1 = tt.set_subtensor(M1[0, 2], tt.exp(logalpha))
        M1 = tt.set_subtensor(M1[0, 1], p1)
        M1 = tt.set_subtensor(M1[0, 3:6], p2[:])
        M1 = tt.set_subtensor(M1[0, 6:9], p3[:])
        s = np.ones(9)
        p, updates = theano.scan(fn=step,
                                 sequences=[M1],
                                 outputs_info=[s])
        ft_out = theano.shared(0.)
        ([ft, xi], updates) = theano.scan(ft_xit_dt,
                                          sequences=[eta, p],
                                          outputs_info=[ft_out, xi_init])

       
        potential = potential + tt.sum(tt.log(ft))

    pm.Potential('likelihood', potential)

Thanks,
Anurag

1 Like

in the most general case, chain lengths can vary too

This is a bit difficult but maybe you can have a batch version of the theano.scan, and pad all the length to the same size using NaNs? Then you can get rid of the for loop as well

How do you make pymc3 ignore the NaNs in this case?

I am having problems with nested theano scan to do this? I have tried to debug this but I have not been too successful. So I was hoping you could help me:

def likelihood_per_individual(ix, eta_curr, p_curr, ft_prev, xi_prev):
xi_curr = eta_curr_per_ind * (tt.dot(p_curr_per_ind.T, xi_prev[0:2]))
ft_curr = tt.sum(xi_curr)
xi_curr = xi_curr/ft_curr
return [ft_curr, xi_curr]

and:

def likelihood(eta_curr, p_curr, ft, xi):
    ([ft_, Xi_], updates) = theano.scan(fn=likelihood_per_individual,
                                        sequences=tt.arange(eta_curr.shape[0]),
                                        non_sequences=[eta_curr, p_curr]
                                        outputs_info=[ft_k_out, xi_k_out])
    #likeli
    return [ft_, Xi_]

along with:

([Ft, Xi], updates) = theano.scan(likelihood,
                              sequences=[eta_, p_],
                              outputs_info=[ft_out, xi_out])
likelihood_ = theano.function(inputs=[eta_, p_, ft_out, xi_out], 
                         outputs=[Ft, Xi], updates=updates)

gives the following error message:

MissingInputError: Input 5 of the graph (indices start from 0), used to compute for{cpu,scan_fn}(Elemwise{minimum,no_inplace}.0, Subtensor{:int64:}.0, Subtensor{:int64:}.0, IncSubtensor{Set;:int64:}.0, IncSubtensor{Set;:int64:}.0, ft_k_out, xi_k_out), was not provided and not given a value. Use the Theano flag exception_verbosity='high', for more information on this error.

Backtrace when that variable is created:

File “/home/anurag/pymc3/lib/python3.6/site-packages/IPython/core/interactiveshell.py”, line 2845, in _run_cell
return runner(coro)
File “/home/anurag/pymc3/lib/python3.6/site-packages/IPython/core/async_helpers.py”, line 67, in _pseudo_sync_runner
coro.send(None)
File “/home/anurag/pymc3/lib/python3.6/site-packages/IPython/core/interactiveshell.py”, line 3020, in run_cell_async
interactivity=interactivity, compiler=compiler, result=result)
File “/home/anurag/pymc3/lib/python3.6/site-packages/IPython/core/interactiveshell.py”, line 3185, in run_ast_nodes
if (yield from self.run_code(code, result)):
File “/home/anurag/pymc3/lib/python3.6/site-packages/IPython/core/interactiveshell.py”, line 3267, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File “”, line 7, in
outputs_info=[ft_out, xi_out])
File “/home/anurag/pymc3/lib/python3.6/site-packages/theano/scan_module/scan.py”, line 774, in scan
condition, outputs, updates = scan_utils.get_updates_and_outputs(fn(*args))
File “”, line 6, in likelihood
ft_k_out = tt.dscalar(“ft_k_out”)

Could you please help me figure this out?

Thanks,
anurag

Do you have a notebook with data somewhere?

It is committed here:

pinging @junpenglao. Any recommendations?

Sorry, I dont have much time this week, but from a quick glance, you should not pass a theano function to the tt.scan:

    ([ft_, Xi_], updates) = theano.scan(fn=likelihood_per_individual,
                                        sequences=tt.arange(eta_curr.shape[0]),
                                        non_sequences=[eta_curr, p_curr]
                                        outputs_info=[ft_k_out, xi_k_out])

instead, likelihood_per_individual should be a python function that call theano.scan internally.

2 Likes

Hi @junpenglao

I tried that at:

But it is still running into errors. It is related to updates from inner scan to outer scan but I cannot understand the error message from theano well enough to debug it. What would you recommend based on the equivalent numpy version of the function in cells 12 through 17?

Thanks again,
Anurag

Hi Anurag,
In general, the key of turning for-loop into theano.scan (or tf.while_loop in TensorFlow) is to keep in mind of the functional form - you need to be able to express the body of the for loop into one single function. We start with looking at the numpy version:

eta0 = np.random.rand(100, 1000, 3)

def ftfunc(P_curr, xi, eta):
    xi2 = eta * (np.dot(P_curr.T, xi[0:2]))
    return xi2

p = np.zeros((100, 1000, 2, 3))
for i in range(100):
    for j in range(1000):
        p[i, j, 0, 0] = 0.95 - np.exp(-6) * np.exp(0.05*j/20)
        p[i, j, 0, 1] = 0.05
        p[i, j, 0, 2] = np.exp(-6) * np.exp(0.05*j/20)
        p[i, j, 1, 0] = 0.01
        p[i, j, 1, 1] = 0.94
        p[i, j, 1, 2] = 0.05

ft2 = np.zeros((100, 1000))
xi2 = np.zeros((100, 1000, 3))


for i in range(100):
    xi2[i, 0] = ftfunc(p[i, 0], np.asarray([1, 0, 0]), eta0[i, 0])

ft2[:, 0] = xi2[:, 0].sum(axis=1)

xi2[:, 0, 0] = xi2[:, 0, 0]/ft2[:, 0]
xi2[:, 0, 1] = xi2[:, 0, 1]/ft2[:, 0]
xi2[:, 0, 2] = xi2[:, 0, 2]/ft2[:, 0]


for i in range(100):
    for j in range(1, 1000):
        P_curr = p[i, j]
        Eta = eta0[i, j]
        Xi_ = xi2[i, j - 1]
        xi2[i, j] = ftfunc(P_curr, Xi_, Eta)
        ft2[:, j] = xi2[:, j].sum(axis=1)
        xi2[:, j, 0] = xi2[:, j, 0]/ft2[:, j]
        xi2[:, j, 1] = xi2[:, j, 1]/ft2[:, j]
        xi2[:, j, 2] = xi2[:, j, 2]/ft2[:, j]

plt.plot(xi2[:, 99, :]);

Something immediately jump out to me is that:

  1. the generation of the time variate transition matrix P and initial latent state xi2 should both be “fold” into the nested for-loop. Likely we dont need to construct P explicitly,
  2. ordering of the nested for-loop: since we can consider we have 100 copy of the time series, and time point is something that have to use theano.scan, best is to vectorized ftfunc, and only keep one loop that would be translate into scan:
ft2_new = np.zeros((1000, 100, 1))
xi2_new = np.zeros((1000, 100, 3))

eta0_ = np.transpose(eta0, [1, 0, -1])

def generate_p(j):
    p = np.zeros((2, 3))
    p[0, 0] = 0.95 - np.exp(-6) * np.exp(0.05*j/20)
    p[0, 1] = 0.05
    p[0, 2] = np.exp(-6) * np.exp(0.05*j/20)
    p[1, 0] = 0.01
    p[1, 1] = 0.94
    p[1, 2] = 0.05
    return p

def ftfunc_batch(P_curr, xi, eta):
    # shape changes here is ((3, 2), (j, 1, 2)) --> (j, 3, 2) --> (j, 3)
    xi2 = eta * np.sum(P_curr.T * xi[:, np.newaxis, :2], axis=-1)
    return xi2

# run once so we have the inital state
p_curr = generate_p(0)
xi2_j = ftfunc_batch(p_curr, np.asarray([[1, 0, 0]]), eta0_[0])
ft2_new[0] = np.sum(xi2_j, axis=-1, keepdims=True)
xi2_new[0] = xi2_j / ft2_new[0]

for j in range(1, 1000):
    last_states = generate_p(j), xi2_new[j-1], eta0_[j]
    xi_new = ftfunc_batch(*last_states)
    ft2_new[j] = np.sum(xi_new, axis=-1, keepdims=True)
    xi2_new[j] = xi_new / ft2_new[j]

np.testing.assert_array_almost_equal(xi2_new[j], xi2[:, j])
plt.plot(xi2_new[99]);

Now, recall that the function signature of theano.scan is:

The general order of function parameters to fn is:

sequences (if any), prior result(s) (if needed), non-sequences (if any)

Thus, we can further restructure the for-loop into:

ft2_new = np.zeros((1000, 100, 1))
xi2_new = np.zeros((1000, 100, 3))

eta0_ = np.transpose(eta0, [1, 0, -1])

def generate_p(j):
    p = np.zeros((2, 3))
    p[0, 0] = 0.95 - np.exp(-6) * np.exp(0.05*j/20)
    p[0, 1] = 0.05
    p[0, 2] = np.exp(-6) * np.exp(0.05*j/20)
    p[1, 0] = 0.01
    p[1, 1] = 0.94
    p[1, 2] = 0.05
    return p

def ftfunc_batch(eta, P_curr, xi, fi):
    # eta, P_curr is sequence
    # xi, fi is previous result
    # shape changes here is ((3, 2), (j, 1, 2)) --> (j, 3, 2) --> (j, 3)
    xi2 = eta * np.sum(P_curr.T * xi[:, np.newaxis, :2], axis=-1)
    fi2 = np.sum(xi2, axis=-1, keepdims=True)
    return xi2 / fi2, fi2

# run once so we have the inital state
p_curr = generate_p(0)
xi2_new[0], ft2_new[0] = ftfunc_batch(
    eta0_[0], p_curr, np.asarray([[1, 0, 0]]), None)


for j in range(1, 1000):
    sequence_input = eta0_[j], generate_p(j)
    prior_results = xi2_new[j-1], ft2_new[j-1]
    
    xi2_new[j], ft2_new[j] = ftfunc_batch(*sequence_input, *prior_results)

np.testing.assert_array_almost_equal(xi2_new[j], xi2[:, j])

Hope this gives you some inspiration to solve your problem.

3 Likes