Advice on dimensionality reduction?

Working recently with highly correlated variables (and giving NUTS a hard time), I would like advice on a variable selection/dimensionality reduction technique that I completely ad hoc just now considered.

Since extremely long running times for a model have been described here quite often as indications that the model is wrong, would it make sense to create a script that loops through different models you have set up, each having different priors (or some other differences you set up), and then record the average iterations/second for a 10 minute trial run?

Everything else being equal, would it be incredibly naive to choose models based on how long they are expected to run, or does my idea have some merit?

It seems to me that this approach is calculating the relative data likelihoods in a roundabout way. That is, you specify some model P(data | params)P(params), and are relying on the speed of sampling to tell you about the relative values of P(data).

Alternatively, you could sample from the priors of the models and calculate the average data likelihood. I suspect that would tell you the same thing.

1 Like

How could my models’ priors be quickly sampled, in their model and data context?

Would this just be pm.sample_prior_predictive()?

My intuition is that time isn’t a great indicator of model correctness. A single level glm will run much faster than a hierarchical model even if it’s wrong, it’s just stealing degrees of freedom. I have simulated models with simulated data for which I know the exact dgp that still take 10+s/iteration. It may make more sense when comparing similar models with different covariates sets.

But I do like the idea of just sampling from the model and comparing the data it produces to your data. Faster way to rule things out if you have a slow model before doing a proper estimation.

1 Like

How could my models’ priors be quickly sampled, in their model and data context?

Use a data-less model to generate parameters, and a data-ful to estimate likelihood:

with pm.Model() as priors_only:
    ÎĽ = pm.Normal('mu', 0, 1, shape=(k))
    L_ÎŁ_vec = pm.LKJCholeskyCov('Lvec', n=k, eta=0.5)
    sim_params = pm.sample(500)


with pm.Model() as data_model:
    ÎĽ = pm.Normal('mu', 0, 1, shape=(k))
    L_ÎŁ_vec = pm.LKJCholeskyCov('L_vec', n=k, eta=0.5, sd_dist = pm.HalfNormal.dist(1))
    L_ÎŁ = pm.Deterministic('L', pm.expand_packed_triangular(k, L_ÎŁ_vec))
    X = pm.MvNormal('obs', ÎĽ, chol=L_ÎŁ, observed=data)
    llik = data_model.logp  # this step is super important
    dliks = [llik({'mu': sim_params['mu'][i,:], 'L_vec': sim_params['L_vec'][i, :]}) for i in xrange(500)]
    # these are P(X|theta)P(theta)
    print 'mean likelihood: %f' % np.mean(dliks)

Would this just be pm.sample_prior_predictive()?

I think this would generate a matrix of stuff that “should look like” the data.

As an aside, the mean likelihood calculated above is a first-order dumb monte-carlo integration. The whole point of NUTS is to make things smarter. Really all this is doing is giving you a sense of (hand-waving) the KL divergence between the un-fit prior and the posterior of the model. From this perspective it’s difficult to distinguish between “bad model” (NO settings of parameters are particularly good) and “bad parameters” (the model is right, but your initial values stink).

A more robust alternative might be to use ADVI (which should be far faster than NUTS) to better approximate the likelihood:

with data_model:
    sim_params = pm.ADVI().fit().sample(500)
    dliks = [llik({'mu': sim_params['mu'][i,:], 'L_vec': sim_params['L_vec'][i, :]}) for i in xrange(500)]

That should at least help distinguish between poor models and poor initial prior parameters.

Bayesian Model Selection (BMS) is a technique similar to what you’ve described
In BMS, the model itself is considered as a parameter in the model.

i.e. at each iteration, instead of just sampling possible parameter values, it changes the actual parameters included in the model.

Techniques for accomplishing this include Reversible Jump MC.

Here’s an article about it in R:

It’s possible to set up in some form in pymc3, in fact I think I’ve seen some exmaples, but there are various problems/difficulties to work through to use it

2 Likes

Thanks a lot for your advice. The first script actually works pretty quickly, taking ~1min. to run, even given my large input data (~5 million rows, 16 columns). I can easily see myself incorporating this into my PyMC3 workflow.

As for the more robust alternative you mentioned utilizing ADVI, it seems to be incredibly slower than the first script, taking ~3hr.s to run. Is it supposed to be like that?

dycontri, I knew there must have been techniques like this published somewhere, so my google skills must have failed. I will look into it, thanks.

1 Like

Is it supposed to be like that?

No. If your rows are independent observations (so you don’t have some pesky MatrixNormal lying about) you can use Minibatch ADVI to speed it up significantly. Just change

pm.SomeDistribution(*args, observed=data)

to

batch=pm.Minibatch(data, 50) # or whatever size
pm.SomeDistribution(*args, observed=batch)

Or (perhaps even better) just run the initial step on a small random subset of the data.

Edit: Also, the default number of steps for ADVI is something ridiculous like 200,000. It tends to converge long before that, so I tend to tune it down significantly. There are callbacks that are supposed to stop it early, but for whatever reason I’ve never gotten them to work.

2 Likes

I would appreciate it if you could clarify a small bit, and that is what does If your rows are independent observations precisely mean? I understand the theoretical importance of independence, but what does this mean practically for PyMC3?

For my time series problem, all of my data are highly auto-correlated, yet non-trivial, because many of them are simply rolling-variables. Would they not be independent?

In practice, if you can subset the rows of your data and still pass it into your model (without, of course, changing the model), then you can use minibatch ADVI.

The cases where you can’t do this are typically because you’re modeling covariances between rows as well as between columns.

For time series, the typical approach is to introduce lag variables (thereby increasing the number of columns); and then the model can utilize the lags within the same row rather than pulling values from previous rows. This then makes each row conditionally independent (given the lag variables) and thus amenable to sub-sampling, bootstrapping, and minibatch fitting.

Thank you for the assurance that using rolling or lagged variables is still possible, and independence is not a matter involving high auto-correlation (as I feared).

I will have to read up on sub-sampling, bootstrapping, and minibatch fitting, because I am still confused. Will reply if I have further questions. Thanks a lot, chartl.