Generating estimations using a distribution for predictor

Brief background, I have a model that describes the inverse relationship between load on a barbell and the max intended velocity achieved. In this model there’s a concept called “minimum velocity threshold” (MVT), simply put its the lowest velocity one can move at and still manage to complete the lift.

Traditionally you can either test the MVT by doing a true max attempt test or you can approximate it by doing a set with as many reps as possible, where the last successful rep would be your MVT.

Over time, the MVT will decrease as you get more skilled in moving weights.

Now back to my model. In order to estimate what % of max load a velocity is, I need to compare an estimate with what the max load estimate is using the MVT. Doing this with fixed numbers is quite easy, e.g (alpha + beta*x) / (alpha + beta*MVT) And I currently solve it by sampling all the velocities from my model and compare the traces with the traces corresponding to my MVT: ppc[i].x / ppc[i].mvt_load

But how would I go about doing this if my MVT is a distribution?

What do you mean by saying MVT is a distribution? Is its value not known? For your notation ppc[i].x / ppc[i].mvt_load, it looks like there is a value per iteration of your PPC.

I mean that after testing I estimate that the MVT for a certain movement might be 0.24 +/- 0.03. That’s the distribution I mean :slight_smile: Up until now I’ve just compared to the estimate of 0.24, but I want to include the uncertainty of the MVT into my model.

Yes. I might for example run ppc for 0.35 and 0.24 (the MVT) and compare each estimate of each line in the PPC to the column corresponding to MVT.

pm.set_data({'x_shared': [0.35, 0.24]})
ppc = pm.sample_posterior_predictive()
intensity = ppc[0]/ppc[1]

Hi Mattias,
IIUC, you want to include uncertainty around your MVT predictor into your model, in which case it’s quite easy in the Bayesian framework: just put a distribution around your predictor! Something like:

mvt_data = pm.Data("mvt_data", mvt_array)
# mvt data is uncertain, so we add uncertainty around it
# values accounting for uncertainty will be estimated via MCMC
mvt_diff = pm.Normal("mvt_diff", mu=0.0, sigma=0.03, shape=same_as_your_data)
mvt_uncert = pm.Deterministic("mvt_uncert", mvt_data + mvt_diff)

And then you just use this variable as if it were a predictor. So, in your regression for instance:

mu = intercept + mvt_effect * mvt_uncert

Is this what you’re looking for?

Not sure if that’s what I’m looking for :see_no_evil:

What I’m trying to do is compare two estimates from my model based on two new predictors, the challenge is that I’m not sure what the second new predictor (MVT) is.

Ultimately I think that my original model has a challenge. It models load~velocity and I want to turn that into load/est_mvt_load~velocity. The problem occurs when defining at what velocity est_mvt_load is, since that is uncertain.

I could do a bunch of samples from the MVT distribution and then use those as new data. However, that would increase the number of data points quite big. Let’s say that I want to get estimated intensity for 100 velocities, using 100 samples from the MVT distribution and 2000 draws. That’s a whooping 20M data points for a single group and I have 3-10 groups in my data.

Or should I perhaps create two models, one for load~velocity, then use that model to create data for an intensity~velocity model using MVT?

Alex’s suggestion wouldn’t increase the number of data points. If you sample from the PPC, and MVT is now a distribution, you will get a different posterior predictive value of MVT per iteration. You might want to increase the number of PPC samples in order to properly estimate your value.

I still don’t quite get what you are trying to do in general. If MVT is now a probability distribution, not a fixed value, then the outcome measure you want to calculate (call it MIV) is also a distribution, not a value. If you select a single value from your MVT distribution (call it x) and then calculate MIV, the value you have now is conditioned upon MVT = x. It’s a conditional value. To get the unconditional distribution of MIV you would have to marginalize out MVT.

1 Like

Thank you both for having the patience :slight_smile:

This was a lot harder to describe than I initially thought. I tried to go through it with a friend in Swedish and was just as hard. Lol.

I could summarize it like this, I could have three models:

  1. Model A describes load ~ velocity
  2. Model B describes the minimum velocity as Normal(0.25, 0.03)
  3. Model C describes intensity ~ velocity using both model A and model B as input.

Intensity = load / max load and max load is derived from model A using model B as input.

So, in the end I would expect that sampling from model C using e.g. 0.30 as input, it would return values from Normal(1, 0.1)

Another way of looking at it would be that I’m trying to normalize the loads to go from with 0 to 1 with 1 being placed at the estimated loads from using model B as predictors.

Today I’m just using the mean of model B as predictor to accomplish the same.

I’m completely lost, sorry.

Is your process like
option_1

or like
option_2
?

EDIT: Can you also give us some examples of expected inputs and outputs? Just to help visualise what’s going on

Perhaps this statsmodel example will make it easier to understand :slight_smile:

import statsmodels.api as sm
import numpy as np

# The data to use for initial modeling
obs_load = [20, 40, 60, 80, 100, 120, 140]
obs_velocity = [0.9, 0.81, 0.8, 0.69, 0.59, 0.45, 0.37]
obs_velocity = sm.add_constant(obs_velocity)

# Frequentist OLS model
model = sm.OLS(obs_load, obs_velocity)
model_fit = model.fit()
model_alpha, model_beta = model_fit.params

# Generate predictions from model
pred_velocities = np.arange(0, 1, 0.01)
pred_velocities = sm.add_constant(pred_velocities)
load_estimates = model_fit.predict(pred_velocities)
load_estimates
array([222.03692383, 219.88018095, 217.72343807, 215.5666952 ,
       213.40995232, 211.25320944, 209.09646656, 206.93972368,
       204.7829808 , 202.62623793, 200.46949505, 198.31275217,
       196.15600929, 193.99926641, 191.84252354, 189.68578066,
       187.52903778, 185.3722949 , 183.21555202, 181.05880915,
       178.90206627, 176.74532339, 174.58858051, 172.43183763,
       170.27509475, 168.11835188, 165.961609  , 163.80486612,
       161.64812324, 159.49138036, 157.33463749, 155.17789461,
       153.02115173, 150.86440885, 148.70766597, 146.5509231 ,
       144.39418022, 142.23743734, 140.08069446, 137.92395158,
       135.76720871, 133.61046583, 131.45372295, 129.29698007,
       127.14023719, 124.98349431, 122.82675144, 120.67000856,
       118.51326568, 116.3565228 , 114.19977992, 112.04303705,
       109.88629417, 107.72955129, 105.57280841, 103.41606553,
       101.25932266,  99.10257978,  96.9458369 ,  94.78909402,
        92.63235114,  90.47560827,  88.31886539,  86.16212251,
        84.00537963,  81.84863675,  79.69189387,  77.535151  ,
        75.37840812,  73.22166524,  71.06492236,  68.90817948,
        66.75143661,  64.59469373,  62.43795085,  60.28120797,
        58.12446509,  55.96772222,  53.81097934,  51.65423646,
        49.49749358,  47.3407507 ,  45.18400782,  43.02726495,
        40.87052207,  38.71377919,  36.55703631,  34.40029343,
        32.24355056,  30.08680768,  27.9300648 ,  25.77332192,
        23.61657904,  21.45983617,  19.30309329,  17.14635041,
        14.98960753,  12.83286465,  10.67612178,   8.5193789 ])
# Generate load prediction for MLE of MVT
# The goal is to set this to be Normal(0.25, 0.03) and estimate using possible values of mvt
mvt = 0.25
mvt_load = model_alpha + model_beta * mvt
mvt_load
168.11835187675757
# Convert model from load~velocity to intensity~velocity
intensities = load_estimates / mvt_load

intensities
array([1.32071794, 1.30788923, 1.29506051, 1.28223179, 1.26940307,
       1.25657435, 1.24374564, 1.23091692, 1.2180882 , 1.20525948,
       1.19243077, 1.17960205, 1.16677333, 1.15394461, 1.14111589,
       1.12828718, 1.11545846, 1.10262974, 1.08980102, 1.07697231,
       1.06414359, 1.05131487, 1.03848615, 1.02565744, 1.01282872,
       1.        , 0.98717128, 0.97434256, 0.96151385, 0.94868513,
       0.93585641, 0.92302769, 0.91019898, 0.89737026, 0.88454154,
       0.87171282, 0.85888411, 0.84605539, 0.83322667, 0.82039795,
       0.80756923, 0.79474052, 0.7819118 , 0.76908308, 0.75625436,
       0.74342565, 0.73059693, 0.71776821, 0.70493949, 0.69211077,
       0.67928206, 0.66645334, 0.65362462, 0.6407959 , 0.62796719,
       0.61513847, 0.60230975, 0.58948103, 0.57665232, 0.5638236 ,
       0.55099488, 0.53816616, 0.52533744, 0.51250873, 0.49968001,
       0.48685129, 0.47402257, 0.46119386, 0.44836514, 0.43553642,
       0.4227077 , 0.40987899, 0.39705027, 0.38422155, 0.37139283,
       0.35856411, 0.3457354 , 0.33290668, 0.32007796, 0.30724924,
       0.29442053, 0.28159181, 0.26876309, 0.25593437, 0.24310566,
       0.23027694, 0.21744822, 0.2046195 , 0.19179078, 0.17896207,
       0.16613335, 0.15330463, 0.14047591, 0.1276472 , 0.11481848,
       0.10198976, 0.08916104, 0.07633232, 0.06350361, 0.05067489])
1 Like

That makes more sense, thanks. I think this is what you’re looking for (with code up to # Generate load prediction for MLE of MVT remaining the same):

no_of_samples = 1000
with pm.Model() as model:
    mvt = pm.Normal("mvt", 0.25, 0.03)
    mvt_load = pm.Deterministic("load", model_alpha + model_beta * mvt)
    intensities = pm.Deterministic("intensities", load_estimates / mvt_load)
    trace = pm.sample(no_of_samples)

The shape of trace[“intensities”] is automatically (no_of_samples, len(load_estimates)) - so for each value of load estimates you have samples from the distribution of possible values

2 Likes

I think that makes sense.

The following challenges are then that the OLS model above is actually a PyMC model, so I have 2000 draws from the posterior predictive using the pred_velocities and not single estimates per input. i.e. load_estimates, model_alpha & model_beta are 2000 draws from a pymc model.

The second is that the MVT model you defined is on a group level, where as I have a hiearchical model in mind with exercise_id as the group identifier (this is also true for the load model).

The first part is straightforward, you can feed those directly into my code - you’ll just end up with an output shape of (no_of_intensity_samples, len(load_estimates), no_of_ols_samples) - no changes required.

I’m not very familiar with hierarchical models, but I think the shape recasting will still happen automatically, and you shouldn’t have much trouble.

1 Like

Awesome, thanks for helping!

1 Like