Rookie Question On Combining PCA and Bayesian Inference

Hello,

I’m not a statistician nor am I a programmer but I try to learn about both outside of my professional work. I am interested in building a model with pymc3 that combines PCA with Bayesian Inference but I’m not exactly sure how to do it properly. I know PCA is used for putting together multiple variables to produce the principal components and with Bayesian I need to assume the priors which are the distributions of certain components and then use them to build a function that will provide a distribution that hopefully doesn’t deviate too much from the actual data. That’s how I understand it. But I am confused when it comes to the specifics.

For example, say I want to build a model to predict GDP and I have three economic time series that I combine together with PCA to produce the principal components (PC). I assume now that I have to break apart the equation in producing the PC and set priors for that. So I put my three economic indicators together into a dataframe called ‘df’ and use the PCA function from sklearn to get the ‘principal_components’ and with some back engineering I learn that the factors (the weights) are used with the ‘df’ to produce the principal components. So here is what I build:

with pm.Model() as pca_model:
    
    #define priors /parameters
    factors = pm.Normal('factors', mu=0, sigma=10)
    
    #define likelihood
    equation = (factors * df).sum(axis=1)
    likelihood = pm.Normal('y', mu=equation, sd=1, observed=principal_components)
    
    #posterior
    trace = pm.sample(1000)

I’m hoping someone can point to where I’m wrong because I’m sure I am misunderstanding something.

Thanks,
Matt

Hi,

It’s a bit difficult for us to try and help without seeing the error that you’re getting? Or which results you got and what were you expecting?

I’m also a newbie here but, giving more explanation on your problem will assist others to help you.

I’m not getting an error, I’m just looking for clarification whether I am doing it correctly, both the statistics and the programming.

One thing I can see right away is that you only have a single estimated weight shared by all factors. This line here:

    #define priors /parameters
    factors = pm.Normal('factors', mu=0, sigma=10)

Only defines a single normal distribution. So when you do this:

    #define likelihood
    equation = (factors * df).sum(axis=1)

You are defining the mean of the data generating process as:
\mu_i= \sum_{j=0}^k \beta X_{i, j}

That is, every column of data (indexed by j) gets the same weight \beta. If you wanted a different weight for each component, you need to make more betas, as follows:

    #define priors /parameters
    factors = pm.Normal('factors', mu=0, sigma=10, size=(k,))
    
    #define likelihood
    equation = df @ factors # (n, k) (k, ) -> (n,)

Are you trying to combine economic indicators into a single index and do PC-OLS? If so, the observed data should be your variable of interest (GDP), not the principal components. You can get the weights used to compute the principal components directly from the sklearn PCA object, see here for details.

This might be helpful: probabilistic+pca.ipynb · GitHub

Would be great to turn this into a pymc-example.

1 Like

Thanks for your message, it was helpful.

So if I have a df of three columns (income, employment, consumption) k should have a value of 3 in the following equation?

#define priors /parameters
    factors = pm.Normal('factors', mu=0, sigma=10, size=(k,))
    
    #define likelihood
    equation = df @ factors # (n, k) (k, ) -> (n,)

And yes I am trying to combine economic indicators into a single index in order to predict GDP. I assumed that since these three indicators are put together to create the principal components which is the index that the principal components data should be the observed data.

Ideally, the final goal would be to develop a model that recalibrates every month to select around 20 economic time series out of a total of 100 depending on which series have the most explanatory power towards GDP. Apply weights to these 20 series depending again on explanatory power. And then use Bayesian inference to come up with probabilities on what GDP should be based on what the 20 most relevant time series are showing. This led me to try and combine PCA with Bayesian.

I have been working with sklearn’s PCA object by creating a df with my economic indicators (excluding GDP) to produce a single principal component that has an adjusted R squared with GDP in the 60s. But I think Bayesian could be added to improve the model.

k is the number of factors you have in the model. Since you’re doing PCA first, it will depend on how many components you show the model. You take your 3 time series and form three principal components, then decide how many to include in the regression.

If you want to select among a large number of explanatory variables, you might consider just using highly skeptical priors (like Laplace) and including all your data. Bayesian regression with priors can be see as a special case of ridge regression (or the other way around?), so variables with low explanatory power will see their coefficients “shrunk” towards zero. I admit I don’t totally understand the weighting scheme you describe, though, so this might not be what you want to achieve.

I’d also encourage you to consider a more robust evaluation metric than R squared, like some kind of out-of-sample error. R squared can be juiced up as high as you’d like, and won’t give you any indication about your model’s ability to generalize (forecast).

I believe the PCA model I’ve been running is with only 1 component.

scale_function = lambda x: (x - x.mean()) / x.std()
pca = PCA(n_components=1).fit(data.apply(scale_function))

Does that mean I have 1 factor, or because I have three variables it would be three factors?

Laplace sounds interesting. I don’t know much about it but when I looked at the distributions of my economic series they have quite high peaks and slim bodies so it might be a good match.

Let me clarify my weighting scheme. The idea was to start with a total of 100 economic series. I would then narrow that down to 20 and then with that 20 I would want to apply a weight depending on its explanatory power but I wouldn’t keep the weighting fixed over time. It would adjust month after month according its is explanatory power. If, for example, consumption’s explanatory power on GDP was growing stronger compared to the prior month, the weight would be adjusted to reflect that. I hope that was clear.

My evaluation metric in selecting the economic indicators was using an OLS regression individually one at a time, against GDP. I looked at R squared as well as mean squared error, heteroscedasticity and the p-value.

Here is the regession function I built to run regressions more quickly. Perhaps I’ve made some errors without realizing it.

def regression(x_var, y_var):
    
    df_actual = combine_data_dropna([x_var, y_var])
    X = df_actual.iloc[:,0]
    X = sm.add_constant(X)
    Y = df_actual.iloc[:,1]
    mod = sm.OLS(Y,X)
    res = mod.fit()
    
    df = combine_data_dropna([x_var, y_var])
    split_index = round(len(df)*0.80)
    split_date = df.index[split_index]
    df_train = df[df.index <= split_date].copy()
    df_test = df.loc[df.index > split_date].copy()
    X_train = df_train.iloc[:,0].values
    X_train = sm.add_constant(X_train)
    y_train = df_train.iloc[:,1].values
    X_test = df_test.iloc[:,0].values
    X_test = sm.add_constant(X_test)
    y_test = df_test.iloc[:,1].values
    ols_model = sm.OLS(y_train,X_train)
    ols_results = ols_model.fit()
    y_pred_train = ols_results.predict(X_train)
    y_pred_test = ols_results.predict(X_test)
    
    h_test = het_breuschpagan(res.resid, res.model.exog)
    print(res.summary())
    print('Mean Squared Error: ' + str(res.mse_model.round(3)))
    print('Heteroscedasticity Test: ' + str(h_test[3].round(3)))
    print('P-Value: ' + str(h_test[1].round(3)))

I appreciate the help as you can tell I am out of my depth.

You have as many factors as columns of your data. You’re only taking 1 component from the PCA, so that’s 1, but if you add a constant that’s 2.

It’s important to realize that these priors are on the beta coefficients, not the data. You are assigning your prior belief about the effect size of a given time series on GDP. Assigning a Laplace prior to the effect sizes biases your model away from large effects, shrinking everything towards zero. The idea is that if you have many time series to include, you don’t need to do PCA at all. Instead, you can bias all your estimates towards zero, so only the truly important ones will make it through the filter, so to speak.

You can learn more about regularizing priors in this lecture of Statistical Rethinking. If you’re new to Bayesian stuff, I recommend this entire series to help you get familiar with how it all works. These lectures use R and Stan, but it’s all been ported to Python and PyMC here [EDIT: This link is outdated, see below]. You could also check out this StatQuest video about Ridge and Lasso regression if you’re not familiar with the idea of shrinkage for model selection (although the framework is not Bayesian, the ideas are the same).

I’m bringing all this up because it’s important to realize that PCA is just a linear transformation of the design matrix. In principle it doesn’t do anything more than any other linear model like OLS. People use it in frequentist frameworks to get around perfect collinearity, but Bayes offers you alternatives that I would argue are more principled (also there’s no need to invert matrices so perfect collinearity isn’t a problem). For example, why did you choose 1 component? Why not 2? For 100 series, why 20 and not 21? How are you doing these weighting adjustments? Why monthly adjustments? Etc etc. Designing a model do all this is much more defensible.

2 Likes

Thanks for all this information. I will definitely find some quality time this weekend with a big pot of coffee and dive into it.

I agree that choosing 20 series is arbitrary but I haven’t reached that point yet. I am constructing what type of model I want in my head and then trying to find the best tools before I finalize a building plan. I came across PCA and Dynamic Factor models and the idea of using weights appealed to me so I started getting more familiar with it in python by building a simple model. Then I came across Bayesian and how it was superior to frequentist ideas and it sounded promising so I started working with pymc to get a feel for it. It led me to the idea of combing PCA and Bayesian but because of my knowledge gap in statistics I couldn’t be sure if I was correct in my thinking which led me here. But if you’re saying that I don’t need to use PCA, that Bayesian alone can provide me with a model that matches the plans I have in my head then I will definitely be sympathetic to that idea.

You said that I can apply Laplace priors to the effect size/beta coefficients. I’m still trying to understand this better in the overall context of building a Bayes model. Going back to my simplified model of using three economic indicators and forgetting PCA for a minute, would that mean creating a Laplace prior for the beta coefficient and then the size variable (k) is 3? And then if GDP values are the observed data I would need to construct an appropriate model for the mean applying this prior variable? Is this the general idea?

To complement, you might want to check pymc-resources/Rethinking_2 at main · pymc-devs/pymc-resources (github.com) instead of the other link that @jessegrabowski provided.

That one is older, from pymc3. This one has been ported to pymc v4, not yet pymc v5 tho but soon someone will probably do it. (I’ll try if i find some time, since I am also a rookie)

2 Likes

Let’s just forget about the whole Laplace/PCA/shrinkage stuff altogether and make sure we’re on the same page with vanilla linear models. All Gaussian linear models follow the same form:

y \sim N(\mu, \sigma)
\mu = \alpha + \sum_{i=0}^k \beta_i \cdot X_i

For you, y is GDP, and you have either a bunch of economic time series or some principal components as X.

Either way, you need to put priors on the parameters in your model. If you have 1 principal component (let’s call it X), then your model is:

\text{GDP}_t \sim N(\mu_t, \sigma)
\mu_t = \alpha + \beta \cdot X_t

So there are 3 parameters: \alpha, \beta, and \sigma. Using my language from before, k=1 because there is one covariate, X. You need to tell the model what you think reasonable values for these are by setting priors. This will, in turn, determine your beliefs about the possible values of GDP.

If you use more things to model \mu_t, you will have more parameters. For example, with 3 factors, call them C_t, I_t, L_t for consumption, investment, and aggregate hours worked, you will have:
\text{GDP}_t \sim N(\mu_t, \sigma)
\mu_t = \alpha + \beta_C \cdot C_t + \beta_I \cdot I_t + \beta_L \cdot L_t

Now you need priors on 5 parameters: \alpha, \sigma, and 3 betas. Using my language from before, k=3 now.

Whether you use PCA or not, none of this part changes.

The next question is what you’re actually trying to achieve, which seems to be variable selection. You have k=100, and you want to pick out a subset of only the most important variables. You could use PCA to get the first k components, which would be a big mix of everything. Or, you could use skeptical priors, such Laplace or “Spike and Slab”, to bias the values of \beta_i towards zero, removing them from \mu_t.

I made a small gist here with a couple examples that might be useful to you, but I strongly recommend you go over the Statistical Rethinking course to build an understanding of what goes on in these models.

1 Like

Thanks very much. I’ll do my best to absorb as much as I can this weekend and let you know where I land.