Learning Poisson State Space Model parameters for large number of samples

I am trying to model the below State Space Model

\begin{align} X(t) &= AX(t-1) + BU(t) + \epsilon_s(t) \\ y(t) &= Poisson(C\exp(X(t))) \\ \epsilon_s{(t)} &\sim N (0,\sigma_s^2) \\ \end{align}

Below is the model for it-

with pm.Model() as model:

    mu1 = np.zeros(latent_var_size)

    sd_dist = pm.Exponential.dist(1.0, shape=latent_var_size)
    chol1, corr, stds = pm.LKJCholeskyCov('chol_cov1', n=latent_var_size, eta=2,
    sd_dist=sd_dist, compute_corr=True)
    A = pm.MvNormal('A', mu=mu1, chol=chol1, shape=(latent_var_size,latent_var_size))
    
    mu2 = np.zeros(num_features)
    chol2, corr, stds = pm.LKJCholeskyCov('chol_cov2', n=num_features, eta=2,
    sd_dist=sd_dist, compute_corr=True)
    #B = tt.as_tensor([tt.inv(tt.mul(Ri,Ci))*dT,  tt.mul(Aw,tt.inv(Ci))*dT,  tt.inv(Ci) *dT]) 
    B = pm.MvNormal('B', mu=mu2, chol=chol2, shape=(latent_var_size,num_features))

    x_init = pm.Normal('x_init', mu = 2, sigma = 2, shape=(num_samples,1,latent_var_size))
    Tau = pm.Gamma('tau', alpha=1e-5, beta=1e-5) 
    X = StateSpaceModel('x', A=A, B=B, u=u,  tau = Tau, x0 = x_init, exo = True,  shape=(num_samples, N, 2))
    
    C =   pm.HalfNormal('C', sigma = 0.2, shape=(latent_var_size,1))
    Tau_o = pm.Gamma('tau_o', alpha=1e-5, beta=1e-5)
    
    Y = pm.Poisson('y', mu= (tt.tensordot(pm.math.exp(X),C,axes=1).T), observed=y.T, shape=(num_samples, N))



with model:
    inference = pm.ADVI()
    tracker = pm.callbacks.Tracker(
    mean= inference.approx.mean.eval,  # callable that returns mean
    std= inference.approx.std.eval  # callable that returns std
    )
    approx = pm.fit(n= 100000, method=inference, callbacks=[tracker])

traceTi = approx.sample(5000)   

Below is the state space model class

from pymc3 import Normal, Flat, Continuous
import theano.tensor as T


class StateSpaceModel(Continuous):

    def __init__(self, tau=None, sd=None, A=None, B=None,  u=None, x0 = None, exo = None,   init=Flat.dist(), *args, **kwargs):
        super(StateSpaceModel, self).__init__(*args, **kwargs)
        self.tau = tau
        self.sd = sd
        self.A = A
        self.B = B
        self.u = u
        self.init = init
        self.mean = x0 
        self.exo = exo
        
    def logp(self, x):
        tau = self.tau
        sd = self.sd
        A = self.A
        B = self.B
        u = self.u
        init = self.init
        # x[0,:] = init
        x_im1 = x[:,:-1,:]
        x_i = x[:,1:,:]
        print(T.tensordot(A,x.T,axes=1))
        if self.exo == True:
            u_im1 = u[:,:-1,:] 
            innov_like = Normal.dist(mu=T.tensordot(A,x_im1.T,axes=1)+T.tensordot(B,u_im1.T,axes=1) , tau=tau, sd=sd).logp(x_i.T)
        else:
            innov_like = Normal.dist(mu=T.tensordot(A,x_im1.T,axes=1), tau=tau, sd=sd).logp(x_i.T)
        return T.sum(init.logp(x[:,0,:])) + T.sum(innov_like)

It takes more than one hour to run the above model with number of samples = 500 and time points = 100. I am not sure why it is taking so much time. Even after the model completes learning process, the parameters have very large standard deviation. Any suggestions on what is wrong with the above model?

Time series models are hard to fit, especially if the state space is large. I have a package for modeling linear state spaces in PyMC here. It might not work in your case, but maybe it will give you some ideas? For my part, I’m confused by your model because I don’t see recursion anywhere. Do you have a citation for the way you’re computing the likelihood of the process? It doesn’t strike me as correct.

I used A state space model distribution for pymc3 · GitHub as a base code to work upon. So the innov_like is evaluated at the current time point using previous time points which is why two different x variables (x_im1 and x_i). Does that make sense? I have been following your GitHub but it was not immediately clear how to modify your code to include Poisson random model.

Your original approach might be right, I’ve just never seen it coded that way. It seems to me like it’s not propagating uncertainty about each state into the future correctly. A priori, I expected something like this, using a scan to construct latent states. That’s what the pymc-statespace package does (but it runs a special algorithm to try to compute the likelihood faster. It’s unclear if that’s still true following some of the recent upgrades to scan likelihood inference)

Thanks a lot, this is really useful. I spend some time making it run, apparently, there is some issue with numpy version and PyMC version. I am wondering how the model runs. If I look at the below lines then Poisson uses the ground truth hidden states, not the hidden_states_pt.

mod.register_rv(hidden_states_pt, name='hidden_states', initval=pt.zeros((T, 2)))
obs = pm.Poisson('obs', pt.exp(hidden_states[:, 0]), observed=data)

Also, I am trying to convert it to a case where we have more than one number of samples-

T = 100
num_samples = 200
num_features = 10
latent_variables = 2

# Local level model
A = np.array([[1, 1],
                   [0, 1]])

Z = np.array([[1, 0]])
R = np.eye(2)

true_Q = np.diag(np.random.normal(scale=1e-2, size=2) ** 2)
hidden_states = np.zeros((num_samples, T, 2))
# hidden_states[0] = true_x0

data = np.zeros((num_samples, T))

fig, ax = plt.subplots()
for t in range(1, T):
    innov_t = np.random.multivariate_normal(mean=np.zeros(2), cov=true_Q)
    hidden_states[:,t,:] =  hidden_states[:,t-1,:]@A + R @ innov_t
    data_mu = np.exp(hidden_states[:,t,:]@Z.T)
    data[:,t] = np.squeeze(np.random.poisson(lam=data_mu))

ax.plot(data[2,:]);
with pm.Model() as mod:
    A = pt.as_tensor(A)
    R = pt.eye(2)
    Z = pt.as_tensor(Z)
    
    sigmas_Q = pm.HalfNormal('sigmas_Q', sigma=1000, size=2)
    Q = pt.diag(sigmas_Q)
    x0 = pt.zeros((num_samples,2))
    
    def step(x, A, R, Z, Q):
        innov = pm.MvNormal.dist(mu=0, tau=Q)
        next_x = x.dot(A) + innov
        return next_x, collect_default_updates([x,A,R,Z,Q], [next_x])
    
    hidden_states_pt, updates = pytensor.scan(step, 
                                           outputs_info=x0, 
                                           non_sequences=[A, R, Z, Q],
                                           n_steps=T, 
                                           strict=True)
    
    mod.register_rv(hidden_states_pt, name='hidden_states', initval=pt.zeros((num_samples, T, 2)))
    obs = pm.Poisson('obs', np.squeeze(pt.exp(hidden_states_pt[:, :, 0])), observed=data.T)

There is some assertion error assert “x.ndim == 2, return Apply(self, , [x.type()])”. I am not sure what is the issue in the multivariate case. I am new to PyMC and trying to learn it. Any help is greatly appreciated. Thanks a lot again!

I guess at this point it would be good if you could go into more detail about your exact model equations? You started from:

\begin{align}X_{t+1} &\sim N(\mu_t, Q) \\ \mu_t &= A x_t \\\ y_t &\sim Poisson(\lambda_t) \\ \lambda_t &= C \exp(X_t) \end{align}

I made some small changes from your original equation, because I don’t make a distinction between X and U. They would just be concatenated together into what I call X, while matrices you called A and B would be block concatenated what I called A. I made some simple choices for A, C, Q, but they’re not necessarily appropriate at all. Actually, you can accomplish what I did much easier just using X_t \sim \text{GaussianRandomWalk}(\cdot).

So what kind of dynamics are you modeling? ARIMA? Seasonal cycles? Random walk with drift? Exponential smoothing?

For multivariate, the original model is multivariate. You shouldn’t be using a 3d tensor as the observed data, you just want to add more columns to X, y, or both.

Thanks for your reply. Below is the model which I am trying to build

X_{i,t+1} \sim N(\mu_{i,t}, Q)\\ \mu_{i,t} = Ax_{i,t} + BU_{i}\\ y_{i,t} \sim Poisson(\lambda_{i,t})\\ \lambda_{i,t} = C \exp(X_{i,t})

So U is some exogenous variable that is known and that does not change with time, and X is latent space that is unknown. Some values in A are known which will model level or seasonality for example. But some will be estimated by the modeling. We assume that we don’t know everything in A and we don’t know anything about B. So we want to estimate X, A, B and C.

So we are modeling some seasonal cycles as well as level trend and the affect of exogenous variables.

So I am not using 3d tensor as the observed data, the shape of observed data is (number of samples, number of time points), which is two-dimensional but I am still not sure why it does not work.

Also, can you please elaborate the below code

mod.register_rv(hidden_states_pt, name='hidden_states', initval=pt.zeros((T, 2)))
obs = pm.Poisson('obs', pt.exp(hidden_states[:, 0]), observed=data)

How does the Poisson model understand that it generates data from hidden_states_pt, because in the above code observed data is generated from (hidden_states[:, 0]) which is ground truth and I am not sure it should be part of the model.

Yea it’s a typo, no wonder my results didn’t make sense :>

Edit: I updated the gist to fix that error.

I added an example of multiple observed states to that gist by following your code. Everything has to do with the structure of the A, Q, and Z matrices. This is why I asked about what kinds of dynamics you are looking to model. I did the example as a collection of 10 independent local level processes. Generally, though, you could have dependencies between the variables. You almost certainly should not be estimating a dense A matrix, it needs to have some sort of structure.

The C matrix is also not usually estimated, it’s most often just an indicator matrix to select observed states. The logic is that anything you want to include in the C matrix could instead be incorporated into the A matrix.

I’d be happy to help if I can with all of that but I’d need to know more about the specific problem.

Thanks a lot! This is really helpful. I was thinking of adding some sort of sparsity to A and some structure but I am not fully sure. I am trying to implement the problem in https://arxiv.org/pdf/1311.7261.pdf. So in this paper, they estimate C. What you are saying makes sense, there should be a lot of structure in the model.

Thanks for sharing the paper, I think it would have been nice to start with that. I think it would be much more clear to dispense with the general notation and try to directly implement that model, starting with their simplest specification. A lot of what they present is closed-form results that you don’t need. I think you can write out equations (1) and (2) inside the step function and you should be good to go, no matrices or state-space stuff needed.

Thanks for the suggestions. I was trying to run your model and it took me 20 mins to run. I am not sure why the sampler would be slow. I tried using ADVI

with mod:
    inference = pm.ADVI()
    tracker = pm.callbacks.Tracker(
    mean= inference.approx.mean.eval,  # callable that returns mean
    std= inference.approx.std.eval  # callable that returns std
    )
    apprx = pm.fit(1000, method='advi', obj_n_mc=5,
                   obj_optimizer=pm.adagrad(learning_rate=1.), callbacks=[tracker])
idata = approx.sample(5000)  

The loss was decreasing at an exceptionally slow rate or was constant for more than 10 mins. I would expect ADVI to run fast and at least have a solution. Do you why the analysis is slow? I am using a 16-CPU machine using parallelism.

Recursive models are among the slowest in PyMC, especially when linear algebra operations are involved. It’s just the way it is for now. There might be a way to implement it in vector operations only; that’s why I wish we had started from there. I feel guilty for leading you down a needlessly general (and therefore needlessly difficult) path.

For example, looking at equation (2) from the paper:

\theta_t = \frac{\theta_{t-1}}{\gamma}\varepsilon_t

This is a Gaussian Random Walk in logs, i.e.:
\log \theta_t = \psi + \log \theta_{t-1} + \eta_t

Where \psi = -\log \gamma and \eta_t = \log \varepsilon_t.

So in PyMC, you could replace all the scanning with (all untested code):

gamma = pm.Beta('gamma')
psi = -pt.log(gamma)

#The .cumsum() makes it a GRW
log_theta = pm.Normal('log_theta', size=T).cumsum() + psi

The only hiccup is figuring out what to do with the innovations. The paper links the shape parameter \alpha to the predicted number of defaults in the previous period: \alpha_{t+1} = \gamma \alpha_t + N_{t-1} (hidden under equation 7). This would require a scan. This might be faster than what you’re doing now, though, because everything else is vectorized. Another option would be to abandon this dependency and just model \alpha in logs as a gaussian random walk. A third option would be to just abandon the time series component in the shape parameters all together, and just assume everything to do with time enters through \theta. I’d personally start with the 3rd option. In that case, you could just write:

epsilon = pm.Beta('epsilon', size=T)
theta = pm.Deterministic('theta', pt.exp(log_theta) * epsilon)
N = pm.Poisson('N', mu=theta, observed=data)

I’d probably start from there and see where it got me.

Thanks a lot for your suggestion. I spent some time working on the code and trying to make it work for ADVI. The below code runs fairly fast and in a reasonable time and accuracy.

with pm.Model() as mod:

    sd_dist = pm.Exponential.dist(1.0, shape=latent_variables)
    mu1 = np.zeros(latent_variables)
    
    chol1, corr, stds = pm.LKJCholeskyCov('chol_cov1', n=latent_variables, eta=2,
    sd_dist=sd_dist, compute_corr=True)
    pred_Z = pm.MvNormal('pred_Z', mu=mu1, chol=chol1, shape=(latent_variables))
    
    mu2 = np.zeros(num_features)
    chol2, corr, stds = pm.LKJCholeskyCov('chol_cov2', n=num_features, eta=2,
    sd_dist=sd_dist, compute_corr=True)
    pred_B = pm.MvNormal('B', mu=mu2, chol=chol2, shape=(num_features))
    
    A_pt = pt.as_tensor_variable(A_block)

    sigmas_Q = pm.HalfNormal('sigmas_Q', sigma=1, shape= (latent_variables))
    Q = pt.diag(sigmas_Q)
    u = pm.MutableData("input", u_scale)
    #x0 = pm.MutableData("x_init", x_0)
    data = pm.MutableData("data", data_train)
    samples = u.shape[0]
    x0 = pt.zeros((num_samples , latent_variables))
    def step(x, A, Q, samples):
        innov = pm.MvNormal.dist(mu=0, tau=Q, size = (samples))
        
        next_x = pt.nlinalg.matrix_dot(x,A) + innov
        
        return next_x, collect_default_updates([x, A, Q, samples], [next_x])
    
    hidden_states_pt, updates = pytensor.scan(step, 
                                              outputs_info=[x0], 
                                              non_sequences=[A_pt, Q, samples],
                                              n_steps=T, 
                                              strict=True)
    
    mod.register_rv(hidden_states_pt, name='hidden_states', initval=pt.zeros((T, samples, latent_variables)))
    
    
    #pred_Z = pm.HalfNormal('pred_Z', sigma=1, size= latent_variables)
    temp1 = pt.transpose(pt.nlinalg.matrix_dot(u,pred_B))
    
    #hidden_states_pt = pt.reshape(hidden_states_pt,(T,num_samples,latent_variables))
    #lambdas = pt.exp(hidden_states_pt[:, np.arange(0, num_samples * latent_variables, 2)]+temp1)
    lambdas = pt.exp(pt.squeeze(pt.nlinalg.matrix_dot(hidden_states_pt,pred_Z))+temp1)

    obs = pm.Poisson('obs', lambdas, observed=data)

I reduced the size of Q since all the samples would have the same Q and made some changes to have more interpretable code. Let me know what you think about the code.

I have implemented the things in the paper but it does not contain state space model with exogenous variables, which is why I am working on the below structure

X_{i,t+1} \sim N(\mu_t, Q)\\ \mu_{i,t}= Ax_{i,t} + BU_{i}\\ y_{i,t} \sim Poisson(\lambda_{i,t}) \\ \lambda_{i,t} = C \exp(X_{i,t})

where i is the ith sample and U is the exogenous variables. This is the most basic thing that I have to make it work, and then I have to keep on improving the model because some exogenous variables will be time-dependent and time-independent, and things like that. So the idea of sharing the paper was to show where the above state space model idea was coming from.

I have a question regarding inference on out of sample. Let’s say I have got a trace using training data and I have some test data with U variables known. Using these U variables, I want to have predictions of \lambda. For this, I was thinking of using predictive posterior sampling, but that won’t work since X hidden states need to be estimated again for each new sample. What would be the best way to have an estimate of \lambda for a new sample U? How would sampling work in this case? Probably I can post this as a separate question.