Is it possible to retrieve pm.Data by column names?

I have read through a few posts, the docs, and the example notebook on how to use pm.Data() to store variables that can later be reset to new data to make subsequent predictions.

Is it possible to set an entire pd.DataFrame and later call individual variables by their column name?

I’ve been testing setting coords and dims, which seem to show up in arviz inferenceData, but I can’t seem to figure out how to call them inside the model with the aesara.tensorSharedVariable object.


"""
Build fake data
"""
size = 400 # Start with 200 data points
true_intercept = 1 # This would be unknown in the real world
slope = 2 # This would be unknown in the real world
x  = np.linspace(0, 1, size)

# y = m * x + b
true_regression_line = true_intercept + slope * x

# add 0.5 noise to test the model and show how Baysian regression will also find the noise level. 
y = true_regression_line + rng.normal(scale=0.5, size=size)
data = pd.DataFrame(dict(x=x, y=y))


coords = {'index': data.index, 'columns': data.columns}
with pm.Model(coords=coords) as model:  # model specifications in PyMC3 are wrapped in a with-statement
    
    #option one - tried this
    #shared = pm.Data("shared", data, export_index_as_coords=True)
    
    #option two
    shared = pm.Data("shared", data, dims=("x", "y"))

    # Define priors
    sigma = pm.HalfCauchy("sigma", beta=1, testval=1.0)
    intercept = pm.Normal("Intercept", 0, sigma=10)
    x_coeff = pm.Normal("x", 0, sigma=3)


    # Define likelihood
    likelihood = pm.Normal("y", 
                           mu=intercept + x_coeff * shared["x"], # <--Something like this?
                           sigma=sigma, 
                           observed= shared["y"] # <-- and Something like this?
                           )

    idata = pm.sample(1000, return_inferencedata=True)

I am able to get several columns like this, but it’s not ideal for future adaptability:

shared = pm.Data("shared", data, dims=("x", "y"))

x = shared.get_value()[:,0]

y = shared.get_value()[:,1]

The doc string on pm.Data leads me to believe I should be able to do this, but pointing me to the Arviz QuickStart guide didn’t exactly shed any light on how:

value : array_like or pandas.Series, pandas.Dataframe
        A value to associate with this variable.
dims : str or tuple of str, optional
        Dimension names of the random variables (as opposed to the shapes of these
        random variables). Use this when ``value`` is a pandas Series or DataFrame. The
        ``dims`` will then be the name of the Series / DataFrame's columns. See ArviZ
        documentation for more information about dimensions and coordinates:
        :ref:`arviz:quickstart`.
        If this parameter is not specified, the random variables will not have dimension
        names.
export_index_as_coords : bool, default=False
        If True, the ``Data`` container will try to infer what the coordinates should be
        if there is an index in ``value``.

Perhaps someone can point me to the right docs

PyMC = 4.0.0b5 // Aesara = 2.5.1 // Arviz = 0.12.0

Thanks for any insight you might be able to share.

There’s no current functionality to do that. Aesara shared variables must get numpy arrays as inputs.

2 Likes

Thanks for the quick follow up!