Proper way to model several variables

shakasaki · August 5, 2022, 12:05am

Hello pyMC’ers. I’m relatively new here and still trying to find the optimal way to do things. I’m currently running inference on a model that involves many variables. I have about 100 unobserved random variables, which are actually pairs; so 50 pairs of unobserved variables. By pair I mean that changing one of the two (these are actually 2 angles) the result will affect the other as well, so they are correlated. I assume they are normally distributed for now

To create these I am running the following:

with pm.Model() as model:
        dips = pm.Normal('dips', mu=0, sigma=3.0, shape=50)
        azimuths = pm.Normal('azimuths', mu=0, sigma=3.0, shape=50)

However, I saw in the tutorial that the way to do this is with specifying coordinates in the model and using dims instead of shape. For example:

with pm.Model(coords={"dip": np.arange(50),
                      "azimuth": np.arange(50)}) as model:
    d = pm.Normal("d", mu=0, sigma=1, dims="dip")
    a = pm.Normal("a", mu=0, sigma=1, dims="azimuth")

The first way seems to work, but I have not finished the inference yet (takes ~2 weeks). My question is, is it a legitimate way to assign multiple variables the way I do it (with shape) or should it be done with dims and coords? What are the differences between these two approaches?

Thanks for the feedback!

cluhmann · August 5, 2022, 2:35am

Coords and dims are available to make your life easier. You are certainly not required to use them. I would suggest taking a look at @OriolAbril 's posts here and here and my own, shorter treatment here. Hopefully those help you see why and when you might want to use coords instead of shapes. If not, feel free to ask here!

shakasaki · August 5, 2022, 11:23am

Thank you Christian! I read through the posts you suggested and through the documentation for xarray. It was particularly clear from your post where you compare the “old” and “new” method, latter being with xarray.

If I can summarize what I understood, hoping this is correct, it is that shape and dims/coords are identical to the MCMC (or any other) sampler. That is, the parameter space will be treated as a collection of variable that are independent. It is just that dims/coords and the use of xarray makes it much more intuitive, because names / attributes can be assigned to the data.

So, these two parameter designations should give identical results when running inference, with the exception that the latter is more intuitive. Below I tried running a test but I was not able to compare the MCMC runs. I thought of doing it through the acceptance rate (two identical runs with the same seed should have exactly the same acceptance rates, right?) but I was not able to get this confirmed. I think I am not properly initiating the inference so that the results are exactly reproducible. But the approaches should be identical, right?

Thanks again for the help!

Edit: I include the random_seed argument when running the samplers but results are still not exactly reproducible, based on subtracting acceptance rates

# import needed packages
import pymc as pm
import numpy as np
import arviz as az

# set dimensions and create random state for reproducibility
dims = 50
rng = np.random.default_rng(5082022)

# create data
dip_data = rng.standard_normal(dims)
azimuth_data = rng.standard_normal(dims)

# run the inferene using dimensions and coordinates
dimension_names = ['Dim ' + str(d) for d in range(dims)]
with pm.Model(coords={"dip": dimension_names,
                      "azimuth": dimension_names}) as model:
    # initialize parameter vectors
    dips = pm.Normal("dip vector", mu=0, sigma=1, dims='dip')
    azimuths = pm.Normal("azimuth vector", mu=0, sigma=1, dims='azimuth')
    # observed data
    observed_dip = pm.Normal(
        'dip angle',
        mu=0,
        observed=dip_data,
        dims='dip')
    observed_azimuth = pm.Normal(
        'azimuth angle',
        mu=np.zeros(dims),
        observed=azimuth_data,
        dims='azimuth')
    compute_trace_dims = pm.sample(1000, tune=1000, cores=1, random_seed=5082022)

#run the inference using shape to define dimensions
with pm.Model() as model2:
    # parameters
    dips2 = pm.Normal('dips', mu=0, sigma=3.0, shape=dims)
    azimuths2 = pm.Normal('azimuths', mu=0, sigma=3.0, shape=dims)
    # observed data
    observed_dip2 = pm.Normal(
        'dip angle',
        mu=0,
        observed=dip_data
    )
    observed_azimuth2 = pm.Normal(
        'azimuth angle',
        mu=0,
        observed=azimuth_data
    )
    compute_trace_shape = pm.sample(1000,
                                    tune=1000,
                                    cores=1,
                                    random_seed=5082022)



# az.plot_trace(compute_trace_dims)
# az.plot_trace(compute_trace_shape)

compute_trace_dims.sample_stats.acceptance_rate - compute_trace_shape.sample_stats.acceptance_rate

cluhmann · August 8, 2022, 1:56am

Your models are not the equivalent. In model, you have sigma=1. In model2, you have sigma=3. When you set to them both to 1, the following gives me zero:

( compute_trace_dims.posterior["dip vector"] -
  compute_trace_shape.posterior["dips"] ).sum()

shakasaki · August 8, 2022, 8:21am

Yes, you’re right, that was a typo. So then the two approaches are equivalent!

Topic		Replies	Views
Multivariate model not behaving as expected bug	3	223	March 19, 2024
Modelling multiple correlated variables Questions	10	1352	November 6, 2023
Understanding dimensions/shapes of variables v5	5	1326	August 29, 2023
How to build a multi-covariance model in pymc3? Questions	2	1216	October 7, 2020
Using shape parameter vs using specifying it using a different model altogether Questions	5	1990	August 29, 2018

Proper way to model several variables

Related topics