Project for learning on pymc site for finding mean temperatures for european cities

Project for learning on pymc site for finding mean temperatures for european cities
with pm.Model(coords=coords) as model: data = pm.ConstantData(“observed_temp”, df_data, dims=(“date”, “city”)) europe_mean = pm.Normal(“europe_mean_temp”, mu=15.0, sigma=3.0) city_offset = pm.Normal(“city_offset”, mu=0.0, sigma=3.0, dims=“city”) city_temperature = pm.Deterministic( “expected_city_temp”, europe_mean + city_offset, dims=“city” ) sigma = pm.Exponential(“sigma”, 1) pm.Normal(“temperature”, mu=city_temperature, sigma=sigma, observed=data, dims=(“date”, “city”)) idata = pm.sample( random_seed=RANDOM_SEED,).
axes = az.plot_trace(idata, var_names=[“europe_mean_temp”, “expected_city_temp”], legend=True);

This is a very elegant method for finding [“europe_mean_temp”, “expected_city_temp”].however I believe that the “europe_mean_temp” which is a very useful and important output of this model but this is not what the posterior samples give or converge to for this variable. Instead I enclose the following explanation:
Calculation of europe_mean_temperature:

(1) city_temperature = (ref_temperature + city_offset),

(2) europe_mean_temperature = {city_temperature(Berlin) + city_temperature(Paris) + city_temperature(San Marino)} / 3,

(3) from (1) and (2), therefore: europe_mean_temperature = ref_temperature + (offset_Berlin + offset_Paris + offset_San_Marino) / 3.

If the sample data is given by:

#Set-up Known Data with wide Pandas format.Equal Samples Sizes

date = pd.date_range(start="2020-05-01", end="2020-07-01")
df = pd.DataFrame({'date':date }).set_index('date')
for city, mu in {"Berlin": 15, "Paris": 16,"San Marino": 20}.items():
    df[city] = pm.draw(pm.Normal.dist(mu=mu, sigma=1),len(date))
    np.random.seed(4)
    df[city]=stats.norm(loc=mu,scale=1).rvs(len(date))
df.index.name = "day"
print('No. of date_measurements =',len(df.index),'\nNumber of_cities =',len(df.columns))
display(df.head())
df.describe().T.round(2)

and the corrected model:

coords = {'mydate':df.index,"mycity": ['Berlin', 'Paris', 'San Marino']} 

with pm.Model(coords = coords) as model_1:
   
    data = pm.ConstantData("observed_temp", df, dims=("mydate", "mycity"))

    
    # group priors
    ref_temperature = pm.Normal("ref_temperature", mu=15.0, sigma=3.0)
    city_offset = pm.Normal("city_offset", mu=0.0, sigma=3.0, dims="mycity")
   
    city_temperature = pm.Deterministic("city_temperature",ref_temperature + city_offset, dims="mycity")   #leave in dims='mycity' for plot
    sigma = pm.Exponential("sigma", 1)

    
   
    # Calculation of europe_mean_temperature
    offset_Berlin=city_offset[coords['mycity'].index('Berlin')]
    offset_Paris=city_offset[coords['mycity'].index('Paris')]
    offset_San_Marino=city_offset[coords['mycity'].index('San Marino')]
    europe_mean_temperature=pm.Deterministic('europe_mean_temperature',
                                             ref_temperature+(offset_Berlin+offset_Paris+offset_San_Marino)/3)
    

    
    pm.Normal("temperature", mu=city_temperature, sigma=sigma, observed=data, dims=("mydate", "mycity"))                                                                         
    
    idata1 = pm.sample(500)
    idata1.extend(pm.sample_prior_predictive())
    idata1.extend(pm.sample_posterior_predictive(idata1))
    az.summary(idata1,kind='stats').rename_axis('Temperature Analysis').round(2) 
az.plot_trace(idata1,legend=True)

with the original pymc model the european_mean_temp v city_offset correlation is approx -1. which again raises the suspicion that the european_mean_temp is not the average of the component cities in the model outlined on the pymc website…

As I am new to modelling and to pymc , I put forward this tweek to the pymc introduction site model with great caution and would value your opinion if I am miss-interpreting some aspect of the design. Thank you for your time an expertise on this puzzle. declan

2 Likes

Thanks for the post! Two quick questions. First, what is faulty/incorrect/not ideal about the existing model/notebook? Second, what did you tweak to fix/improve it? I flipped back and forth between your model/results an the original, but I am having difficulty quickly answering those 2 questions.

Thanks!

Thank you for looking at post.

The code on the pymc site: defines expected_city_temp" as europe_mean + city_offset, and the results can be read in: az.summary() . Looking down through these results you will see a figure quoted for “europe_mean_temp”, but becaues of the way the model is framed this cant be representative of the european mean of the component cities because of having been defined as follows: expected_city_temp" = europe_mean + city_offset. eg with this definition the mean of the sum of the three city temperatures will not be expected to converge to the true mean of the 3 cities. I suggest that what is labelled in the code as europe_mean should perhaps be labelled as a ref_temp for the city_offsets and labelling it as a ref_temperature would prevent it from being interpreted as a mean of the component individual cities. Then if the true european mean is required this can then be obtained from :

Calculation of ‘europe_mean_temperature’

(1) Since: 'city_temperature'=(ref_temperature + city_offset),
(2) 'europe_mean_temperature' = [city_temperature(Berlin)+city_temperature(Paris)+city_temperature(San Marino)]/3
 (3) from (1) therfore ='europe_mean_temperature' = ref_temperature+(offset_Berlin+offset_Paris+offset_San_Marino)/3

I am a newbi but I have attempted to modify the code by : replacing europe_mean by ref_temperature and since the european mean estimate from the posterior samples is a useful entity I have introduced it in the line: europe_mean_temperature=pm.Deterministic(‘europe_mean_temperature’,
ref_temperature+(offset_Berlin+offset_Paris+offset_San_Marino)/3)
the results in az.summary now make sense and the europe_mean in az.summary() now adds to the listed mean of the temperatures of the 3 citties.
city_temperature[Berlin] 15.06
city_temperature[Paris] 16.05
city_temperature[San Marino] 20.04
europe_mean_temperature 17.05

note that europe_mean_temperature is now equal to (15.06+16.05+20.04)/3=17.05 where as the variable listed as europe_mean_temperature in the original model this is not true. I hope this clarifies the point and I hope my coding attempt gets around the problem. please let me know if I have misinterpreted anything as this is such a neat elegant model. declan

Ah, I see now. I think the difference the two values you calculated (ref_temperature and europe_mean_temperature) represent different things and are each correct assuming you want the quantity it represents.

The different cities’ temperatures are expected to be ref_temperature+city_offset. So Berlin’s temperature is expected, to be ref_temperature+city_offset[0], Paris’s ref_temperature+city_offset[1], and San Marino’s ref_temperature+city_offset[2]. If we take the mean of these three deterministics

europe_mean_temperature=pm.Deterministic('europe_mean_temperature',
                                         ref_temperature+(offset_Berlin+
                                                          offset_Paris+
                                                          offset_San_Marino)/3)

what we have is the mean temperature of these three cities. As you point out, the mean of the posterior associated with this deterministic is very close to the mean of the observed data (whereas the mean of the ref_temperature posterior doesn’t quite match the mean of the observed data).

The key insight is that Europe has cities other than Berlin, Paris, and San Marino. So the mean of these three particular cities’ temperatures may or may not represent the mean of all of Europe. Maybe we picked 3 particularly hot (e.g., southern) cities or perhaps we selected 3 particularly cold (e.g., northern, high elevation) cities. If we’re not sure about the representatives of our 3 cities, one way to account for this is to assume that there is some overall, but unknown, mean temperature that applies to all of Europe. Each city’s temperature would then be expected to deviate from this overall mean by some unknown amount.

The original model creates a parameter that corresponds to this overall, unknown Europe-mean accordingly

europe_mean = pm.Normal("europe_mean_temp", mu=15.0, sigma=3.0)

which encodes some prior knowledge that this overall, unknown Europe-mean is somewhere in the vicinity of 15 (with our confidence encoded in the sigma of 3). Because each observation is associated with a given city and each city’s temperature (partially) reflects this parameter (city temperature is europe_mean + city_offset), the observed data from Berlin, Paris, and San Marino will influence our (posterior) beliefs about europe_mean, but only to some extent. Why? Because we have only observed temperatures in 3 cities.

Perhaps the easiest way to see the difference between these 2 quantities is to imagine that you only have temperature from a single European city. It would be a bit odd to say that you believe that the average temperature of Europe is equal to the average temperature observed somewhere in Norway (for example).

Hopefully that helps. Also, please note that a) I didn’t write the notebook and so I am making assumptions about why the model was put together the way it was and b) the purpose of that notebook is to illustrate how to handle data, not how to handle hierarchical models/nested data (see here instead).

2 Likes

Thank you so much for your very prompt, thoughtful and detailed reply which I agree with and has helped me to see the model from different perspective. Thanks again.
I am now satisfied that the pymc model on the site is correct (which shows a difference between the “european mean” (of the 3 component cities) and the mean of the individual cities in the posterior output. Seemingly, this is acceptably explained by ‘shrinkage’ which is outlined for hierarchical models by Osvaldo Martin’s Bayesian book and the effect seen is expected. Other checks shows that the effect is still there with say 10 cities, or with a very large number of temperature measurements which should overwhelm the prior or with a Uniform prior on the european temperature. The cities with closer agreeing values exert a bigger pull than an ‘outlier’ city say with a more extreme temperature , moving the european mean estimate away from the the mean of the individual cities: ie the majority values are given more believability, such is the wisdom of Bayes Rule!
Thanks again for your time and experience in assessing these theoretical concerns.

Prior knowledge about the mean of of the 3 cities’ temperatures is encoded in both ref_temperature and city_offset. So if you have some knowledge about the reference temperature (whatever that reference represents) and some idea about where the 3 cities you have data on fall relative to that reference, then you can adjust each accordingly.

However, this strategy can break down rather quickly. For example, imagine that I give you hyper-precise daily measurements of the temperature of these three cities taken over the last 100 years and further imagine that you have no idea where these cities might fall relative to the overall European mean. I that case, you would (I assume) no longer wish to model each city’s prior temperature as some combination of your knowledge of Europe’s mean (about which you are highly uncertain) and each city’s temperature relative to that continental mean (about which you are highly uncertain). Doing so, would cause your prior for each city’s mean to be highly uncertain, which doesn’t seem to be consistent with that scenario (something you would quickly discover if you performed some prior predictive checking). Instead, you would want to “disconnect” the individual city means (about which you have good prior knowledge) from the European mean (about which you have poor prior information).

However, you might instead imagine a scenario in which you have strong prior knowledge about the entire set of cities (or region), not about the cities per se and not about Europe. In that case, you could instead replace the European reference with a “regional” reference and use the same basic model. If you have some knowledge about a) how each city’s temperature was related to the region’s temperature and b) how the region’s temperature was related to the temperature of European, then you could add another level to the hierarchy.

Hope that helps.

Thank you so much for your very prompt, thoughtful and detailed reply which I agree with and has helped me to see the model from different perspective. Thanks again.
I am now satisfied that the pymc model on the site is correct (which shows a difference between the “european mean” (of the 3 component cities) and the mean of the individual cities in the posterior output. Seemingly, this is acceptably explained by ‘shrinkage’ which is outlined for hierarchical models by Osvaldo Martin’s Bayesian book and the effect seen is expected. Other checks shows that the effect is still there with say 10 cities, or with a very large number of temperature measurements which should overwhelm the prior. The cities with closer agreeing values exert a bigger pull than an ‘outlier’ city say with a more extreme temperature , moving the european mean estimate away from the the mean of the individual cities: ie the majority values are given more believability, such is the wisdom of Bayes Rule!
Thanks again for your time and experience in assessing these theoretical concerns.

1 Like