Help with Matrix-Based Inference (A toy problem)

Hello all,

I am trying out a simple toy model to infer the value of X in the following matrix-based equation

Y = AX + noise

The challenge I’ve set myself is seeing if I can push the number of dimensions/slices. In the example considered, I have 100 dimensions. A and X are square (full) matrices both with shape (100,4,4). This means that Y also has shape (100,4,4)

Here is the full code, currently running on PyMC v5.17.0. I should add that my model includes a few commented-out alternative methods for priors that l I have been experimenting with…

#%% define the observation

#Set Random Seed
np.random.seed(43)

# Set the number of slices to consider 
numslices = 100
slices = np.linspace(1,numslices, numslices)

# Generate random arrays (to be inferred) with an arbitrary scaling factor 
A_scale = 100
X_scale = 100

A = A_scale*np.random.random((numslices, 4, 4))

# Perform operations to set the observation
X =  X_scale*np.random.random((numslices, 4, 4))

Y_true =A@X

# make the observation noisy 
noise_obs = 0.5 #  THIS WILL BE THE SIGMA VALUE TO BE INFERRED
noise = np.random.normal(0, noise_obs, size=Y_true.shape)

#Combine the noise and the true signal to make the observation
Y_obs = Y_true + noise

#%% pymc variables and convert arrays to pytensor values to be used in the pymc model
A_pt = pt.as_tensor_variable(A)
data_pt= pt.as_tensor_variable(Y_obs)


#%% PYMC Model

with pm.Model() as model:
    #Hierarchical model based on same sigma for all values of X
    # sigma = pm.HalfNormal("sigma", sigma = 1) # Hierarchical on sigma
    # X_pymc = pm.Normal("X_pymc",mu=0, sigma = sigma, shape = (100,4,4))

    # Non-Hierarchical Prior on 𝜎
    # X_pymc = pm.Normal("X_pymc",mu=0, sigma = 1, shape = (100,4,4))
    # sigma = pm.HalfNormal("sigma", sigma = 1) # Hierarchical on sigma
    
    # Adjusted (Non-Centred) Hierarchical setup
    sigma = pm.Normal("sigma", mu = 0.5, sigma=0.1)
    X_raw = pm.Normal("X_raw", mu=0, sigma=1, shape=(100, 4, 4))
    X_pymc = pm.Deterministic("X_pymc", X_raw * sigma)

    # Prediction 
    Y_pymc =  A_pt @ X_pymc 

    # Likelihood
    likelihood = pm.Normal("likelihood", mu= Y_pymc, sigma=sigma, observed=data_pt)

# %% Sampling
with model:
    trace = pm.sample(
        draws = 1000, 
        tune=1000, 
        chains=4,
        cores=4)#
    # , target_accept=0.95, return_inferencedata=True
    # )

# SUMMARISE AND PLOT

# %% Summarise Results - get the mean and form a matrix (like the true value of A)
summary = az.summary(trace)
X_means = summary.loc[summary.index.str.startswith("X_pymc"), "mean"].values 
X_shape = (numslices, 4, 4) 
X_means_matrix = X_means.reshape(X.shape) 

#%% 
# Plot Results (slices vs value) 
fig, axes = plt.subplots(4, 4, figsize=(18, 14), sharex=True, sharey=True) 

for row in range(4): 
    for col in range(4):
        ax = axes[row, col]
        ax.plot(slices, X[:, row, col], 'k',label="True", alpha=1)   
        ax.plot(slices, X_means_matrix[:, row, col],'r--', label="NUTS mean after sampling", alpha=1)
        ax.set_title(f'C[{row+1},{col+1}]', fontsize=10 )
        # ax.grid(True)  # Add grid for readability
        if row == 3:  # X-axis labels only for the last row
            ax.set_xlabel('Slice', fontsize=9)
        if col == 0:  # Y-axis labels only for the first column
            ax.set_ylabel('Value', fontsize=9)

# Adjust the layout to create space for the legend and title
# plt.subplots_adjust(top=0.75, bottom=0.1, left=0.1, right=0.95, hspace=0.4, wspace=0.4)
plt.subplots_adjust(top=0.1, bottom=0, left=0.01, right=0.095, hspace=0.3, wspace=0.1)

# Add a shared legend above the subplots
handles, labels = axes[0, 0].get_legend_handles_labels()
fig.legend(handles, labels, loc='upper center', ncol=2, fontsize=12, frameon=True, bbox_to_anchor=(0.5, 1.0))

# Add a main title even higher above the legend
# fig.suptitle('Results (Slices vs Value)', fontsize=16, y=0.95)

plt.show()


#%% Diagnostics

# Selecting some specific indices along A_pt  
subset = trace.posterior["X_pymc"].isel(
    X_pymc_dim_0=0 ,    # Slice index 
    X_pymc_dim_1=0,     # Row index 
    X_pymc_dim_2=0      # Column index  
) 

#Plot the chosen traces 
az.plot_trace(subset) 
az.plot_posterior(subset) 

#trace for sigma (noise)
az.plot_trace(trace , var_names = 'sigma') 


I seemed to have some joy with non-informative priors. I didn’t think it would be the best idea to plot reultant traces and distributions for such a large set of unknowns, so I have created a plot of true values (X) vs the mean of those after sampling. You can see below the inference has worked well for X but the trace and value for sigma is way off (circa 15 instead of 0.5). the Arviz summary shows the inference of sigma has poor numbers for Rhat and ESS

I have tried a number of approaches to change the likelihood and can’t seem to get an improvement in what should be the correct value of “0.5” (see noise_obs = 0.5 in the code above).

If I start adjusting the A_scale and X_scale scaling parameters in this section

# Generate random arrays (to be inferred) with an arbitrary scaling factor 
A_scale = 100
X_scale = 100

A = A_scale*np.random.random((numslices, 4, 4))

# Perform operations to set the observation
X =  X_scale*np.random.random((numslices, 4, 4))

Y_true =A@X

and change both A_scale and X_scale to 1, it seems to flip the issue with poorer inference of X and improved but not quite correct values of sigma. I get that is likely due to possible “signal-to-noise” issues since the standard deviation for the additive noise (0.5) sigma is close to the scaling value.

In this example, I can see from the Arviz summary that the ‘Rhat’ and ‘ESS’ for sigma are much improved, with clear normal distribution shapes, however again, the centre of the distribution is centred on the wrong value…



In summary, I would be grateful for any pointers on

  1. Any fundamental omissions/limitations in the code I may have simply missed.
  2. The use of hierarchical, non-centring or non-hierarchical approaches. Is there a method considered “best -fit” for such problems?
  3. Methods to extract the correct value for sigma. Particularly given the scaling could vary.

Here’s hoping someone out there can help. Thanks :slight_smile:

This doesn’t match the formula you wrote down, which takes Y = A \cdot X + \epsilon. Assuming \epsilon \sim \text{normal}(0, \sigma), then the likelihood should be \text{normal}(y \mid A \cdot X, \sigma). This makes the noise additive as you wrote it in text rather than multiplicative as you wrote it in code.

That is, the noise should be additive, not multiplicative. And sigma is just the scale of the noise. So I think what you want is something like this:

sigma = pm.Lognormal("sigma", mu=log(0.5), sigma=0.1)
mu = pm.Deterministic("mu", A @ X)
pm.normal(mu=mu, sigma=sigma, observed=Y)

where A is your unknown matrix and X is observed. I gave sigma a lognormal distribution because it’s positive constrained—you could also use a half normal, but I don’t think a full normal makes sense.

1 Like

Hi Bob (and anyone else out there!)

That’s a good shout. In fairness, I probably mistyped as I was trying other alternatives and this one slipped through. I initially tried the way you suggested, as it made the most sense to me formulation-wise, though the problem of inferring \sigma remains. Admittedly this approach is much more robust. Before I change this too solved, allow me to show a few examples which I would really appreciate some feedback on.

I suspect it is a signal to noise issue but there are some strange goings on.

when the scale of X = 1 and sigma = 0.1,

I get decent inference with the following

with pm.Model() as model:
    sigma = pm.HalfNormal("sigma", sigma=2)  # Reflects true noise level
    X_pymc = pm.Normal("X_pymc", mu=0, sigma=2, shape=(numslices, 4, 4))  # Narrower prior
    Y_pymc = pm.Deterministic("Y_pymc", A_pt_noscale @ X_pymc)

    # Likelihood for normalized Y_obs
    likelihood = pm.Normal("likelihood", mu=Y_pymc, sigma=sigma, observed=Y_obs_pt_noscale)
    
    samples = pm.draw(X_pymc, draws=500)
    trace = pm.sample(draws= 2000, 
                      tune= 2000)

the resulting inference for X across the slices are shown below (slices 1-100 on the x-axis, values on the y-axis). The matrix of X was 4x4x100 slices, hence the subplot layout. Looks reasonable to me…

prior predictive checks are off but resolved after sampling


sigma is ok

and the the reconstructed true value of Y also checks out, albeit with some areas still showing large errors.

I noticed that the ESS for sigma is very low in comparison to others

when I increase the scale to say X = 10 and sigma = 10 (and also increase the scale of the priors accordingly)

the values for X get a little wilder but still not bad

and sigma hods up too though the ESS is poor now

If I go on like this, I can get progressively worse inference values. It seems that this should be reasonably easy to infer, but I would ask for some help/opinions on how to fix it. If I make the signal somewhat more noisy (sigma = 100, say), then the inference on sigma ostensibly looks like it settled on a confident value. However, the mean of 88 is a little off, and the correct value falls outside the bulk of the posterior.


image

I would be super grateful for any tips or obvious pointers. I did wonder if scaling the priors might be the answer but was unsure about how I would correctly scale the noise aspect? Presumably the scaling would be equivalent to any scaling of Y? Would more tuning or samples help?

Here’s the code if you wish to run it for the extreme case I mentioned last. It is not that long to reproduce the problem. I’ve included the code for subplots too if it helps…

# Set random seed for reproducibility
np.random.seed(43)

# Constants
numslices = 100
A_scale = 10
X_scale = 10

A = A_scale * np.random.normal(loc=0, scale=1, size=(numslices, 4, 4))
X = X_scale * np.random.normal(loc=0, scale=1, size=(numslices, 4, 4))
Y_true = A @ X

# Add noise
noise_sigma = 100
noise = np.random.normal(0, noise_sigma, size=Y_true.shape)
Y_obs = Y_true + noise

# Convert to PyTensor tensors
A_pt_noscale = pt.as_tensor_variable(A)
Y_obs_pt_noscale = pt.as_tensor_variable(Y_obs)

#%%
# Define PyMC model
with pm.Model() as model:
    sigma = pm.HalfNormal("sigma", sigma=100)  # Reflects true noise level
    X_pymc = pm.Normal("X_pymc", mu=0, sigma=20, shape=(numslices, 4, 4))  # Narrower prior
    Y_pymc = pm.Deterministic("Y_pymc", A_pt_noscale @ X_pymc)

    # Likelihood for normalized Y_obs
    likelihood = pm.Normal("likelihood", mu=Y_pymc, sigma=sigma, observed=Y_obs_pt_noscale)
    
    samples = pm.draw(X_pymc, draws=500)
    trace = pm.sample(draws= 2000, 
                      tune= 2000)

""" the remaining code is for data summary and plotting """
#%% Summarise the result 
summary = az.summary(trace) 

#%% print results for sigma  

sigma_mean = summary.loc["sigma", "mean"]  # Posterior mean of sigma
print(f"Inference sigma mean is {sigma_mean:.4f}. True sigma value is {noise_sigma} ") 


#%%  Extract the relevant samples
X_means = summary.loc[summary.index.str.startswith("X_pymc"), "mean"].values 
X_upper = summary.loc[summary.index.str.startswith("X_pymc"), "hdi_97%"].values 
X_lower = summary.loc[summary.index.str.startswith("X_pymc"), "hdi_3%"].values 

X_shape = (numslices, 4, 4) 
X_means_matrix = X_means.reshape(X.shape) 
X_upper_matrix = X_upper.reshape(X.shape) 
X_lower_matrix = X_lower.reshape(X.shape) 

Y_pymc_means = summary.loc[summary.index.str.startswith("Y_pymc"), "mean"].values 
Y_pymc_upper = summary.loc[summary.index.str.startswith("Y_pymc"), "hdi_97%"].values 
Y_pymc_lower = summary.loc[summary.index.str.startswith("Y_pymc"), "hdi_3%"].values 

Y_pymc_shape = (numslices, 4, 4) 
Y_pymc_means_matrix = Y_pymc_means.reshape(X.shape) 
Y_pymc_upper_matrix = Y_pymc_upper.reshape(X.shape) 
Y_pymc_lower_matrix = Y_pymc_lower.reshape(X.shape) 


Y_pred_mean_recovered = A @ X_means_matrix
Y_pred_lower_recovered = A @ X_lower_matrix
Y_pred_upper_recovered = A @ X_upper_matrix


#%% Compare inference X with True X

fig, axes = plt.subplots(4, 4, figsize=(18, 14), sharex=True, sharey=True) 
for row in range(4):
    for col in range(4):
        ax = axes[row, col]
        ax.plot(range(numslices), X[:, row, col], 'k', label="True X", alpha=1)
        ax.plot(range(numslices), X_means_matrix[:, row, col], 'r--', label="Inferred X_mean", alpha=1)
        ax.fill_between(range(numslices), 
                        X_lower_matrix[:, row, col], 
                        X_upper_matrix[:, row, col], 
                        color='b', alpha=0.3, label="HDI")
        ax.set_title(f"X[{row+1},{col+1}]")
        if row == 3: ax.set_xlabel("Slice")
        if col == 0: ax.set_ylabel("Value")

# Adding a global legend
handles, labels = axes[0, 0].get_legend_handles_labels()
fig.legend(handles, labels, loc="upper center", ncol=3)
plt.show()


#%% Plot true vs inferred values of Y

fig, axes = plt.subplots(4, 4, figsize=(18, 14), sharex=True, sharey=True)
# Loop through subplots
for row in range(4):
    for col in range(4):
        ax = axes[row, col]
        ax.plot(range(numslices), Y_true[:, row, col], 'k', label="True", alpha=1)
        ax.plot(range(numslices), Y_obs[:, row, col], 'b--', label="Observed", alpha=1)
        ax.plot(range(numslices), Y_pred_mean_recovered[:, row, col], 'r--', label="recovered", alpha=1)
        ax.fill_between(range(numslices), 
                        Y_pred_lower_recovered[:, row, col], 
                        Y_pred_upper_recovered[:, row, col], 
                        color='b', alpha=0.3, label="recovered HDI")
        ax.set_title(f'Y[{row+1},{col+1}]', fontsize=10)
        if row == 3: ax.set_xlabel('Slice', fontsize=9)
        if col == 0: ax.set_ylabel('Value', fontsize=9)

# Add legend
handles, labels = axes[0, 0].get_legend_handles_labels()
fig.legend(handles, labels, loc='upper center', ncol=4, fontsize=12, frameon=True, bbox_to_anchor=(0.5, 1.0))

# Add an overall title
fig.suptitle("Comparison of True vs Inferred Values Across Slices for Y", fontsize=16, y=0.95)

# Adjust layout
plt.subplots_adjust(top=0.90, bottom=0.05, left=0.05, right=0.95, hspace=0.4, wspace=0.4)

# Show the figure
plt.show()

.

Did you adapt the priors so that the model matched the data-generating process? If not, the fixed priors may be the culprit, such as the half normal on sigma with scale 2 (that notation is super confusing due to the three different uses of sigma), because it’s inconsistent with sigma = 100.

My guess is that PyMC is giving you the correct posterior mean given your prior. You should be able to test with more data to see if you get closer to 100, or test with increasing the scale of the prior to at least sigma=50 if not sigma=100. Or even better, give it a half normal with a location of 100 and say a scale of 50.

1 Like

Hi - thanks for that. Really helpful :slight_smile:

I had indeed scaled the priors in line with the shift in observation amplitude changing but shifting the Half-Normal is something I had not considered. I investigated this, revealing some interesting results.

I did what you said, namely

Or even better, give it a half normal with a location of 100 and say a scale of 50

and behold a sigma_noise (I changed the name!) of around “1” which with a shift of 100 is spot on

with pm.Model() as model:
    shift = 100
    sigma_noise = pm.HalfNormal("sigma_noise", sigma=50) + shift  # Reflects true noise level

However, this requires me to have some accurate pre-knowledge of sigma. If I move the goalposts slightly and reduce the shift to say 50, but ensure the sigma of the HalfNormal would range over the correct value to be inferred (100), I still get this mean value of 88 (38 mean plus a shift of 50).

I see also that the distribution below is now Normal and not Half Normal. Is this a clue to some tradeoff wise between priors and likelihoods? **

with pm.Model() as model:
    shift = 50
    sigma_noise = pm.HalfNormal("sigma_noise", sigma=100) + shift  # Reflects true noise level

In cases like these, priors can matter a lot. Ideally, you are in a situation where you can use a hierarchical prior because the whole thing’s repeated multiple times. But if not, I’d recommend trying priors that are weakly informative as to the scale of the answer. So if you expect the answer to be in the -300 to 300 range, a \text{normal}(0, 100) makes sense. But if you expect the answer to be in the 70 to 130 range, then a \text{normal}(100, 10) prior makes more sense in terms of concentrating prior probability mass where it controls the scale of the answer (here, around 100).

It’s still half normal. It’s just that you won’t be able to tell the difference between a normal

p(x) = \text{normal}(x \mid 100, 2)

and a positively-constrained half normal with a big shift,

p(x) = \text{normal}_+(x \mid 100, 2) = \text{normal}(x \mid 100, 2) / \int_0^\infty \text{normal}(x \mid 100, 2) \, \text{d}x,

because the integral in the denominator is indistinguishable from 1. If you compare a \text{normal}_+(x \mid 0, 1) then the integral in the denominator is

\int_0^\infty \text{normal}(x \mid 0, 1) \text{d} x = 0.5.

1 Like

I see - thanks!

Your comment about using hierarchical priors has tickled my fancy, but I’m not sure how to implement it. Is this for \sigma only? I’d appreciate anything you can offer on that front.

Finally, the inferred values for X in the previous posts look okay to me, but I wondered if you had any advice as to possible improvements? Some regions look a little off, such as…

Do you think more sampling and prior adjustment would make an improvement? Or perhaps something else I have not considered. It’s the latter point really as I can always try the former myself!

Many thanks once again. This has been massively helpful.