Sampling semantics of multiple observed variables

[This seems like a super obvious FAQ, but I can’t find an answer to it because I have no idea what set of keywords would reveal it.]

How do I represent a multi-variable observation process? I’m not sure of the right term for this, so let’s take an example:

What if when I observe some process, I get simultaneous values for variables A, B and C? Put differently, every observation is of the form <a, b, c>?

Is this correctly represented as, for example:

Variable("A", observed=as)
Variable("B", observed=bs)
Variable("C", observed=cs)

My guess is that the answer is “no”, and the as, bs, and cs above would be taken as independent observations of the three variables.

But in my case, the observations include (a[0], b[0]. c[0]), and so forth and (a[x],b[y],c[z]) is an incorrect observation, in general, unless x = y = z.

In general, also, A, B, and C are not the same type of variables (e.g., A might be Categorical, B TruncatedNormal, and C Gamma), so I can’t just make a vector with shapes.

As I said, this seems like an incredibly basic question, but I don’t know how to find an explanation of what it means when multiple variables in a model are observed, or how to make a tuple of observations. I can’t build an observation tuple with a deterministic RV, because I can’t observe a deterministic RV.

Thanks, and apologies for such a basic question.

It seems like a partial answer is implicit in this example here: https://docs.pymc.io/notebooks/survival_analysis.html

See especially [13]. It seems that the author creates a lambda_ array of deterministic random variables, and then a corresponding mu deterministic RV and this somehow interacts with the ordering of the observed death times. But alas I still don’t understand how the independent random variables are synchronized with the corresponding output variables (values of death), and this appears too obvious to the author to need explanation.

Maybe a concrete example will help. I am building a model of logic circuits implemented in living cells. For a fragment of this model, consider that we have two observed inputs: the growth medium (there are three growth media), and the input, both categorical, and both observed. There is also an output value, which should be clustered around a high level or low level, depending on the inputs (two bits), according to the XOR function.

I was starting with a linear model, in which the output is distributed as Normal(mu = io + v * mf) . where io ~ Normal(mu = high or low), the “ideal output.” So, for example, for input 01, io ~ Normal(mu=high, sd=x).

Now I have some n of observations, split between the three media. So based on the survival analysis cited above, I was going to make a mask array of 1’s and 0’s (v above) 1 row for each of the 01 inputs for example, and three wide for the three growth media.

That was the only way I could think of to “bind together” the observations, which are each < input, medium, output>. But doing this seems to rely on an intimate knowledge of the way PyMC3 translates to Theano and decides whether or not to broadcast, etc.

Part of what I came up with was the following snippet:

influences = {x: pm.Deterministic('medium influences %s'%x, T.exp(media_vars_sorted * shared(masks[x]))) for x in inputs }
    for x in inputs:
        pm.TruncatedNormal("XOR%s out"%x, 
                           mu=influences[x].sum(axis=1) + ideals[x], 
                           sd=1.0, lower=0.0, upper=8.0, 
                           observed=outputs(x))

So influences[x] will (I hope!) be an array of 3 (the different growth media) by n for n the number of samples (the number of entries in the one-D array outputs(x)). Each row will have only one non-zero entry. ideals[x]will be a scalar.

My fond hope is that using influences[x].sum() will have the result of summing the three (2 zeroes and one non-zero) entries in the row, and then add the scalar (ideals[x]), giving me n random variables, each of which has one observation. I don’t know how PyMC3 would determine from the fact that there are n mus and n observations, that it should map them one to one.

But I don’t know if this will do what I want and, even worse, I don’t even know how to tell whether it’s doing what I want or not. Or if this is just a bad way to achieve something that can be better done some other way. Any advice very much welcomed!

Your model looks like a hierarchical linear model. The general way of doing this relies on a covariate data matrix that might look like

intercept  media_A  media_B  input_XOR
        1        1        0       0    
        1        1        0       1   
        1        0        1       0    
...

Here the XOR of the input bits is pre-computed. Then:

X = theano.shared(data_matrix)
y_obs = theano.shared(data_obs)
with pm.Model() as mod:
    beta_vals = pm.Normal('beta', 0., 2., shape=4)
    err_sd = pm.HalfNormal('err_sd', 1.)
    y_pred = pm.Deterministic('y_pred', tt.dot(X, beta_vals))
    lik = pm.Normal('y_obs', mu=y_pred, sd=err_sd, observed=y_obs)

In this way the (independent) observations map to the (dependent) observations through the unknown parameters; and X[i,:] corresponds to y_obs[i].

see

for more examples

Thank you very much for the example.

So am I correct in thinking that what is happening here is that PyMC3 (or Theano) are noticing that y_pred has n entries in it, and lik has n entries, and assumes from this that we have n random variables with n observations? Whereas in the case where y_pred is a scalar, then we would get 1 random variable with n observations?

It makes me a little nervous that the vector/scalar distinction is not very apparent when one reads the code, but that is probably just me.

It’s just a trick with how broadcasting works (http://deeplearning.net/software/theano/tutorial/broadcasting.html). Ultimately when you get to the terminus of the computation graph, you get something that looks like

(y_obs - y_pred)/err_sd

If y_obs is of shape k; this function is only defined for y_pred (and err_sd) of shape 1 or shape k; corresponding to the situations you described.

It’s not just you: the penalty for convenience here is a lack of clarity. Below I have pasted a model (from a private message) where the difference between imputing missing data and marginalizing over missing data, is an obscure shape= term. They both run perfectly well; but if it were a particularly expensive computation, I would hate to have wanted to impute the data – only to get a single marginal distribution out the other side.

So…be fastidious!

import numpy as np
import pymc3 as pm


# generate data
n_sample=200
X_1 = np.random.normal(4,2,n_sample)
X_2 = np.random.normal(12,5,n_sample)
x_train = np.vstack((X_1,X_2)).T
beta_ = np.array([.3,.9])
alpha_ = 5 #intercept
sd_ = .05
y_train = alpha_ + sd_ * np.random.randn(n_sample) + np.dot(x_train, beta_.T)
c = 6 #censored value
X_1_obs = X_1[X_1 < c]
X_1_cen = X_1[X_1 >= c]
y_obs = y_train[X_1 < c]
y_cen = y_train[X_1 >= c]
X_2_obs = X_2[X_1 < c]
X_2_cen = X_2[X_1 >= c]

method = 'APPROX'
with pm.Model() as mod:
    # model for X1 ("U1")
    u = pm.Normal('u_mean', mu=0., sd=4.)
    h = pm.HalfNormal('u_sd', 4.)
    U1_obs = pm.Normal('U1o', mu=u, sd=h, observed=X_1_obs)
    if method == 'APPROX':
        U1_cen = pm.TruncatedNormal('U1c', mu=u, sd=h, lower=c)  # MARGINALIZE over censored data (!)
    else:
        U1_cen = pm.TruncatedNormal('U1c', mu=u, sd=h, lower=c, shape=(X_1_cen.shape[0],)) # normal distribution is a special case
    # model for coefs and intercept
    err = pm.HalfNormal('err', 2.)
    intercept = pm.Normal('int', 0., 1.)
    b1, b2 = pm.Normal('b1', 0., 1.), pm.Normal('b2', 0., 1.)
    # model for y 
    yobs = pm.Normal('yobs', intercept + b1 * U1_obs + b2 * X_2_obs, err, observed=y_obs)
    ycen = pm.Normal('ycen', intercept + b1 * U1_cen + b2 * X_2_cen, err, observed=y_cen)
    trace = pm.sample(500, tune=1000, chains=6, cores=2)