Time-series data generation in PyMC3

Hi. I am trying to model the evolution of an object through several discrete states.

The amount of time that the object spends in each state, “tau", is a stochastic variable, parameterized by the “state” variable. State transition cannot happen, while “tau” is greater than time, “t”. In other words, while t < tau, the transition probability is 0. Once t > tau (i.e. the allocated time for the current state is up), a Bernoulli process determined by trans_prob, “p”, moves the object into one of two new states. In a new state, new time allocation, “tau”, for that state is simulated, and the whole process continues until t_max is reached. The ultimate goal is to visualize the dependency of the state on time (plot object state as a time-series).

I would like to start by defining a deterministic time variable “t”, initialized to 0, and incremented by 1 until it reaches t_max. However, my intuition tells me that this is precisely the wrong approach to take in PyMC. Also, I was unable to figure out how to create an upper bound, t_max, for such a variable. Anyway, here’s my feeble attempt at “t”:

init = 1 # Flag that signals t = 0
t = pm.Deterministic(“t”, pm.math.switch(pm.math.neq(init, 1), t + 1, 0)
init = 0

Assuming I can get past “t” definition, I am going to have a circular dependency problem next. The stochastic state duration variable, “tau”, is parameterized by the “state” variable. But the state variable, in turn, depends on “tau” via transition probability, “p”, that changes as “t”, “tau” and “state” change.

To recap:

“t” is just a time step number
“tau” is a stochastic variable dependent on “state”
“p” is determined by “t”, “tau”, and “state”
“state” is determined by “p”

I understand that the process does not quite make sense in the way I am trying to define it just above. Perhaps the “state” variable should be lagged, i.e. a distinction should be made between the “last”, or “current” state and a “new” state? How would I program something like that in PyMC3?

Any advice is greatly appreciated.

Thanks! Your comment is helpful.

I saw references to “jump processes” and “Markov jump processes” in finance. I think that may be what I have.

I think such process could be described by a regime switching model mentioned here: Stochastic Volatility with Jump Diffusion?. Original model description in Stan: http://modernstatisticalworkflow.blogspot.ch/2018/02/regime-switching-models-in-stan.html. Defined here is PyMC: https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/Ports/Regime-switching%20models%20in%20PyMC3.ipynb.

Do you think I am on the right track?

Sorry, I am not sure because I have just started learning to use PyMC3.

I think I misread your post. I think you mean the following.

We have an observable trajectory of states from time t=1 to time t = T.
Each state X is associated with a distinct latent variable \tau_{X}

Transition of the states follows the following steps.

  1. Get current state (let’s say A)
  2. Wait for \tau_A time steps. During this wait time, the state cannot transit to another state.
  3. Transit to another state, following the transition matrix.

I am not sure about your goal.

Do you mean you want to find the distributions \tau_X for each state X?

In step 3, do you allow self transition? Can a state A transit to state A?

Apologies, I probably made it more complicated than it needs to be in my original post.

My goal is to simulate the object trajectory through states X from t = 0 to t = T, using the knowledge about X at t=0,τ (x) and X transition probabilities. I want to get the trajectory estimate with a confidence interval around it. I try to achieve it by looping over the steps 1-3 you describe until t = T, or a terminal state is reached (a state with known infinite τ, and a transition probability of 0).

Self transition in Step 3 is not allowed. One of two new states (both distinct from A) is permitted.

I have τ (x) distribution parameter estimates obtained from (τ, X ) value pair data. So, to perform step 2, I just look up those parameters for state A and tell PyMC to use them to get τ (A).

I may want to extend the model later to determine additional variables of interest, dependent on t, τ and X

For each state x, Is \tau(x) a number instead of a probability distribution?

If I understand correctly, I think your goal is:

  1. simulate one trajectory of states.
  2. simulate probability distribution of all possible trajectories of states over time
  3. NOT fitting the parameters of the model with existing trajectories of states.

1 seems easy. Probably don’t even need pymc3 to do that.

Can solve 2 by brute force if your number of states is low. Simulate lots of 1 and then get the distribution. Don’t have to simulate over infinite time because the whole model becomes stationary over time. The confidence interval won’t change after certain number of time steps.

Need pymc3 to do 3.

For non-terminal X, τ(x) is lognormally distributed with X-specific mu and sigma. For terminal states, τ(x) is infinity.

Yes! My goals are (1) and (2). (3) is tricky, because I would want to do that, but I don’t have trajectory data at the moment. I don’t have the raw data that was used to fit τ(x) distributions either, just the fitted distribution parameter estimates (mu and sigma for each non-terminal X). I probably can get trajectory data later on though and check the validity of my τ(x) parameter estimates and transition probability matrix.

I was going to code (1) and (2) in numpy/pandas, but figured might as well start in PyMC3. I want to extend the PyMC model later to determine additional variables that are dependent on X, t, τ and some stochastic variables, which I do have the data to fit. I guess I should do (1) and (2) in numpy / pandas first anyway. Then I will have a benchmark for PyMC, or another such framework.

Thank you! You helped me quite a bit.

I haven’t tried simulation with pymc3.

The random variables in pymc3 has a random() method https://docs.pymc.io/developer_guide.html

Thank you. I updated the thread title to what I think more accurately reflects its topic.

I think this pull request was intended to provide the data generation functionality: https://github.com/pymc-devs/pymc3/pull/2983. It appears to be active, but not in a stable release yet. Maybe someone can shed more light on this.

You should start by taking a look at the timeseries distributions and examples. In particular, the stochastic volatility model is a good place to start.

The time series distributions implemented in pymc3 only have their logp method and not random. This means that these methods cannot do conventional forward sampling as the rest. This is due to the fact that in most timeseries scenarios, one has observations up until a certain time and wants to predict what will happen later. This is handled with the infrastructure use to deal with missing data. What one has to do first is set up the observed data. One has to make an array, set the values of the first part with the true observed data, and then mask the remaining part. By doing this, when one calls sample, you infer the models parameters and at the same time do the predictions of the future out of sample observations.

So my suggestion is to first look at what is readily available in pymc3, with its examples, see if you can phrase your model with that, and if you can’t then we can try to help you to express a timeseries model for your needs. But the real blocker for timeseries data generation is that wee have to implement all random methods for all timeseries distributions.

Thanks @lucianopaz. I ended up doing my data generation outside of PyMC, as I did not need any parameter fitting in this case. However, I used PyMC to model time-series data in another problem and the missing data prediction feature seems to work wonders. As a philosophical discussion point: is there a use case for explicitly calling .random() method inside a model context?