Is there documentation or examples of GLM arguments


#1

I would like to be using the GLM, but I’m finding it hard because I don’t see any documentation about what the arguments mean. IIUC we can use a formula with from_formula to get a GLM where the system allocates normals for the components, and … something (?) for the weights.
The function also offers priors, vars, family, and name arguments, but I don’t see any documentation that explains what these are.
family seems to be just the family of random variables used for the terms on the RHS of the distribution.
but what are the priors and what kind of arguments go there? Do we pass in Variable objects (like a Normal)? Or are those priors for the weights? similarly, what are the vars? Random Variables corresponding to the variables appearing in the formula? If so, are they tied in by name?
There’s a “data” argument, as well. Is this the same as the observed argument in other parts of pymc3?
Sorry if I’m being dense, or have simply overlooked something.
Also, I will see if I can translate any responses into patches for the documentation.
Thanks!!


#2

I also feel sometimes that PyMC3 does not have sufficient documentation for certain features, yet there are many great examples and articles that explains most of the features done by the developers.

If you want to understand how GLM works please go through these articles.

  1. https://docs.pymc.io/notebooks/GLM-robust.html
  2. http://docs.pymc.io/notebooks/GLM-logistic.html

It is mentioned in the article 1 that,

PyMC3’s glm() function allows you to pass in a family object that contains information about the likelihood.

Therefore, using the likelihood you can decide the type of regression that you want to performe (e.g. linear regression - normal, logisitic regression - binomial etc)

Data argument is not just simple the observed in this case. If you look at the articles they explain that the data is a data structure similar to pandas dataframe with all the data that are required to train the model. The column headers are important in this case because, the “vars” define the relationship between the observed and the predictor variables (in linear regression x values) as shown below (from article 2),

pm.glm.GLM.from_formula(‘income ~ age + age2 + educ + hours’, data, family=pm.glm.families.Binomial())

The income is the observed, others are the predictor variables. The attributes in this relationship are iidentified using the column headers. Since the family is Binomial() this is correspondent to logistic regression.

priors are to define the distributions for the priors. You can find most of those information if you read the docstring of the glm.py scripts from git repo.


#3

Thank you very much, Nadheesh.

I did actually read those two web pages, but didn’t understand them fully. For example, they seem to be written for people who understand a notation used in … R? I think? That input language is never explained, and I went down a rabbit hole trying to figure out what it was.
Meanwhile, the documentation for PyMC3 describes a GLM that does not appear to use a formula and that has a Component item that is never explained.
One big question I have is whether the GLM is what one uses when there are hidden components (other than the weights), or if it’s only a tool for fitting weights. The discussion on those web pages, I believe, refers only to models where you have weighted sums where all the terms are observed, but the weights are not – or maybe there’s also an unobserved noise?
I think this is probably not for me, since I have a model where I am dealing with the sum of differently-distributed normals, where the components are not observed (but can be inferred because they appear in different sums).


#4

You are right… we sometimes take the knowledge of formula notation from R (or Matlab) for granted. I think officially it is referred to as Wilkinson notation.

In GLM module, the input X is fully observed and noiseless. So you are correct that only the weights are latent variables. If you have other latent variables, using the native PyMC3 model language is probably easier.


#5

@rpgoldman I hope @junpenglao answer your questions.

I’m not that familiar with underline implementation, but from what I know GLM is for simple regression models (as @junpenglao explained when only weights are the latent variables), even it seems hierarchical regression is not possible with GLM (since the hyper-priors are not the weights of some observed input).

Even though this article has the title “GLM: Hierarchical Linear Regression” at the end it says,

Finally, readers of my blog will notice that we didn’t use glm() here as it does not play nice with hierarchical models yet.

This is just an suggestion, I think this article should have a small section or few sentences to tell the reasons why they do not use GLM with hierarchical models even-though title says so. Otherwise, the title is somewhat misleading.

It is always good to discuss about such limitations, because that could save the time of users/developer that they spend trying to do something that is not supported at the moment.


#6

You are right. The GLM session should rename into “linear model” or something similar, as not all of them use the glm module.
In general, the glm module is not maintained really well - contribution is always welcome to make it better :wink:


#7

I would love to.

But I’m still trying to understand the things and connecting the dots together. Once I’m confident enough, then I can contribute. :slight_smile:


#8

Thank you very much for the clarification. I roughly inferred this from the examples – unfortunately, only after trying to use GLM inappropriately! – but it was never clearly stated.I think this page: http://docs.pymc.io/api/glm.html needs some attention. It could use a topic paragraph stating what this class is for, and defining some of the terms. The examples help a little, but they don’t define the terms, either.
A particular challenge is that the examples all use the R-style notation, but this documentation page never refers to it (there is no mention of the from_formula method, and I believe that trying to build a model from the two classes described here is now deprecated.
I hope that helps.