How to run logistic regression with weighted samples

timzen · August 24, 2020, 7:36am

Hi there,
I am fairly new to pymc3 and was using it for some basic regressions to get started. So far I liked it a lot. My Problem is now getting a logistisc regression with weighted samples to run. For my problem one of the two classes is heavily undersampled and moreover some data points are more important to get right.

Without using weights my model looks like this:

    model = pm.Model()
    with model:
        m = pm.Normal("m",0,sd=10,shape =len(cols))
        x = df[cols].to_numpy()
        w = np.ones(len(x))
        labels = df["label"]

        b = pm.Normal("b",0,sd=10)
        score = pm.math.dot(x,m) + b
        mu = pm.math.sigmoid(score)

        likelihood = pm.Bernoulli("y",mu,observed=df["label"])
        nsample = 500
        ntune = int(0.5*nsample)
        trace = pm.sample(nsample,tune=ntune,cores=4)

This model runs fine and I get values in the expected range. The Problem is now moving to the weighted regression. I read various posts about it like this for example and looked at the documentation for pm.potential. My last try at the problem was:

model = pm.Model()
    with model:
        m = pm.Normal("m",0,sd=10,shape =len(cols))
        x = df[cols].to_numpy()
        w = np.ones(len(x))
        labels = df["label"]
        b = pm.Normal("b",0,sd=10)

        score = pm.math.dot(x,m) + b
        mu = pm.math.sigmoid(score)
        logp = pm.Bernoulli.dist(p=mu).logp(labels)
        error = pm.Potential("error",logp)
        nsample = 500
        ntune = int(0.5*nsample)
        trace = pm.sample(nsample,tune=ntune,cores=4

But this does not work either. I am happy for any help and recommendations.
Best regards,
Tim

Elenchus · August 24, 2020, 8:15am

Can you give us more details about what isn’t working, and/or some mock data which will reproduce the error?

I got it to run with

m = pm.Normal("m",0,sd=10)
x = np.random.rand(200, 3)
labels = np.random.randint(2, size=200)

but I had dimension mismatches if I tried to include a shape in m.

timzen · August 24, 2020, 9:00am

Hi,
thanks for the quick reply. In my example there are 8 inputs for each label, so the shape parameter is in fact 8. So i guess simple mock data for x could be:

x = np.random.rand(200,8)

But how did you come around using the shape parameter with more than just one x for each label? Omitting it does not solve the problem in my case.
Here is also the error and traceback I get (with the original example)

Traceback (most recent call last):
  File "baysian_logistic_model.py", line 40, in <module>
    logp = pm.Bernoulli.dist(p=mu).logp(labels)
  File "/anaconda3/envs/test/lib/python3.8/site-packages/pymc3/distributions/discrete.py", line 369, in logp
    return bound(
  File "/anaconda3/envs/test/lib/python3.8/site-packages/pymc3/distributions/dist_math.py", line 76, in bound
    return tt.switch(alltrue(conditions), logp, -np.inf)
  File "/anaconda3/envs/test/lib/python3.8/site-packages/pymc3/distributions/dist_math.py", line 82, in alltrue_elemwise
    ret = ret * (1 * c)
  File "/anaconda3/envs/test/lib/python3.8/site-packages/pandas/core/ops/common.py", line 64, in new_method
    return method(self, other)
  File "/anaconda3/envs/test/lib/python3.8/site-packages/pandas/core/ops/__init__.py", line 505, in wrapper
    return _construct_result(left, result, index=left.index, name=res_name)
  File "/anaconda3/envs/test/lib/python3.8/site-packages/pandas/core/ops/__init__.py", line 478, in _construct_result
    out = left._constructor(result, index=index)
  File "/anaconda3/envs/test/lib/python3.8/site-packages/pandas/core/series.py", line 279, in __init__
    data = com.maybe_iterable_to_list(data)
  File "/anaconda3/envs/test/lib/python3.8/site-packages/pandas/core/common.py", line 280, in maybe_iterable_to_list
    return list(obj)
  File "/anaconda3/envs/test/lib/python3.8/site-packages/theano/tensor/var.py", line 640, in __iter__
    for i in xrange(theano.tensor.basic.get_vector_length(self)):
  File "/anaconda3/envs/test/lib/python3.8/site-packages/theano/tensor/basic.py", line 4828, in get_vector_length
    raise ValueError("length not known: %s" % msg)
ValueError: length not known: Elemwise{mul,no_inplace} [id A] ''   
 |TensorConstant{(2000,) of 1} [id B]
 |Elemwise{mul,no_inplace} [id C] ''   
   |InplaceDimShuffle{x} [id D] ''   
   | |TensorConstant{1} [id E]
   |Elemwise{ge,no_inplace} [id F] ''   
     |sigmoid [id G] ''   
     | |Elemwise{add,no_inplace} [id H] ''   
     |   |dot [id I] ''   
     |   | |TensorConstant{[[2.197224..        ]]} [id J]
     |   | |m [id K]
     |   |InplaceDimShuffle{x} [id L] ''   
     |     |b [id M]
     |InplaceDimShuffle{x} [id N] ''   
       |TensorConstant{0} [id O]

Elenchus · August 24, 2020, 9:30am

Apologies, I must have got mixed up between a couple of attempts before, and copied parts of two (don’t you hate when you’d swear you’ve done the same thing twice, and suddenly it’s not working anymore?). This worked for me:

with pm.Model() as model:
    m = pm.Normal("m",0,sd=10, shape=8)
    x = np.random.rand(200,8)
    w = np.ones(len(x))
    labels=np.random.randint(2, size=200)
    b = pm.Normal("b",0,sd=10)
    score = pm.math.dot(x,m) + b
    mu = pm.math.sigmoid(score)
    logp = pm.Bernoulli.dist(p=mu).logp(labels)
    error = pm.Potential("error",logp)
    trace = pm.sample(500)

logp = pm.Bernoulli.dist(p=mu).logp(labels) seems to be the problem line in your error. Can you check the labels column for invalid values/NaNs/empty cells? Failing that maybe look at the values in mu (you can add a deterministic node and view the output in the trace, like det = pm.Deterministic('det', mu) - remove the the last few nodes so it will run, and make sure it’s working as expected

nkaimcaudle · August 24, 2020, 9:38am

I’m not sure on the weighting aspect but for labels try

labels = df["label"].values

timzen · August 24, 2020, 9:41am

Thanks, this was it. I had it in in previous (not running) versions and I somewhere lost it.
Now it runs as it should.

Thanks a lot to both of you for your help.

AlexAndorra · August 24, 2020, 10:42am

Thanks for the solution guys! Just wanna point out that I’m not sure there is any weighting here, as w = np.ones(len(x)) is defined by never used in the model – am I missing something?

timzen · August 24, 2020, 11:12am

You are right, but from here on including the weights seemed straightforward to me. For completnes and if later users might have the same question here the final model

    with model:
        m = pm.Normal("m",0,sd=10,shape =len(cols))
        x = df[cols].to_numpy()
        w = df["weight"].to_numpy()
        labels = df["label"].to_numpy()

        b = pm.Normal("b",0,sd=10)
        score = pm.math.dot(x,m) + b
        mu = pm.math.sigmoid(score)

        logp = w*pm.Bernoulli.dist(p=mu).logp(labels)
        error = pm.Potential("error",logp)

        nsample = 200
        ntune = int(1.0*nsample)
        trace = pm.sample(nsample,tune=ntune,cores=4)

For dummy data just use

x = np.random.rand(200,8)
w = np.random.rand(len(x))
labels=np.random.randint(2, size=200)

as suggested above. I hope now everything is a bit more clear.

AlexAndorra · August 25, 2020, 7:56am

Perfect, thanks for including the full solution – you read my mind

tomkov · February 15, 2021, 2:25pm

how do you use posterior predictive distribution for new examples with this approach?

RusselDaries · August 30, 2021, 9:26am

Hey everyone, is there any follow up to this?

Would you use pm.sample_posterior_predictive for this? I wonder how to eliminate the weighting for new examples.

Topic		Replies	Views
Sample posterior check with pm.Potentials Questions modeling	4	558	March 17, 2023
Error from sample() method for multivariable linear regression	0	359	October 11, 2021
Mixture model using pm.Mixture or writing in parts version agnostic modeling	6	613	August 14, 2023
How to weight observations in v4? v5	2	879	July 27, 2022
New To PyMC3 \| Logistic Regression - Bug Questions bug	4	783	November 4, 2020

How to run logistic regression with weighted samples

Related topics