Is it possible to retrieve pm.Data by column names?

Derrick_Lewis · May 24, 2022, 6:48pm

I have read through a few posts, the docs, and the example notebook on how to use pm.Data() to store variables that can later be reset to new data to make subsequent predictions.

Is it possible to set an entire pd.DataFrame and later call individual variables by their column name?

I’ve been testing setting coords and dims, which seem to show up in arviz inferenceData, but I can’t seem to figure out how to call them inside the model with the aesara.tensorSharedVariable object.


"""
Build fake data
"""
size = 400 # Start with 200 data points
true_intercept = 1 # This would be unknown in the real world
slope = 2 # This would be unknown in the real world
x  = np.linspace(0, 1, size)

# y = m * x + b
true_regression_line = true_intercept + slope * x

# add 0.5 noise to test the model and show how Baysian regression will also find the noise level. 
y = true_regression_line + rng.normal(scale=0.5, size=size)
data = pd.DataFrame(dict(x=x, y=y))


coords = {'index': data.index, 'columns': data.columns}
with pm.Model(coords=coords) as model:  # model specifications in PyMC3 are wrapped in a with-statement
    
    #option one - tried this
    #shared = pm.Data("shared", data, export_index_as_coords=True)
    
    #option two
    shared = pm.Data("shared", data, dims=("x", "y"))

    # Define priors
    sigma = pm.HalfCauchy("sigma", beta=1, testval=1.0)
    intercept = pm.Normal("Intercept", 0, sigma=10)
    x_coeff = pm.Normal("x", 0, sigma=3)


    # Define likelihood
    likelihood = pm.Normal("y", 
                           mu=intercept + x_coeff * shared["x"], # <--Something like this?
                           sigma=sigma, 
                           observed= shared["y"] # <-- and Something like this?
                           )

    idata = pm.sample(1000, return_inferencedata=True)

I am able to get several columns like this, but it’s not ideal for future adaptability:

shared = pm.Data("shared", data, dims=("x", "y"))

x = shared.get_value()[:,0]

y = shared.get_value()[:,1]

The doc string on pm.Data leads me to believe I should be able to do this, but pointing me to the Arviz QuickStart guide didn’t exactly shed any light on how:

value : array_like or pandas.Series, pandas.Dataframe
        A value to associate with this variable.
dims : str or tuple of str, optional
        Dimension names of the random variables (as opposed to the shapes of these
        random variables). Use this when ``value`` is a pandas Series or DataFrame. The
        ``dims`` will then be the name of the Series / DataFrame's columns. See ArviZ
        documentation for more information about dimensions and coordinates:
        :ref:`arviz:quickstart`.
        If this parameter is not specified, the random variables will not have dimension
        names.
export_index_as_coords : bool, default=False
        If True, the ``Data`` container will try to infer what the coordinates should be
        if there is an index in ``value``.

Perhaps someone can point me to the right docs

PyMC = 4.0.0b5 // Aesara = 2.5.1 // Arviz = 0.12.0

Thanks for any insight you might be able to share.

ricardoV94 · May 24, 2022, 7:28pm

There’s no current functionality to do that. Aesara shared variables must get numpy arrays as inputs.

Derrick_Lewis · May 25, 2022, 7:22pm

Thanks for the quick follow up!

Topic		Replies	Views
Dims in pm.Data v5	5	789	November 22, 2023
Naming matrix columns Questions	2	552	December 22, 2019
How to use pm.Data with pm.Minibatch? v5 modeling , sampling , prediction	5	96	February 12, 2025
Help with Out of Sample Predictions	12	687	August 24, 2023
Predictives from a simple model in pymc4 v5 modeling , arviz	4	598	January 16, 2023

Is it possible to retrieve pm.Data by column names?

Related topics