How to run logistic regression with weighted samples

Hi there,
I am fairly new to pymc3 and was using it for some basic regressions to get started. So far I liked it a lot. My Problem is now getting a logistisc regression with weighted samples to run. For my problem one of the two classes is heavily undersampled and moreover some data points are more important to get right.

Without using weights my model looks like this:

    model = pm.Model()
    with model:
        m = pm.Normal("m",0,sd=10,shape =len(cols))
        x = df[cols].to_numpy()
        w = np.ones(len(x))
        labels = df["label"]

        b = pm.Normal("b",0,sd=10)
        score = pm.math.dot(x,m) + b
        mu = pm.math.sigmoid(score)

        likelihood = pm.Bernoulli("y",mu,observed=df["label"])
        nsample = 500
        ntune = int(0.5*nsample)
        trace = pm.sample(nsample,tune=ntune,cores=4)

This model runs fine and I get values in the expected range. The Problem is now moving to the weighted regression. I read various posts about it like this for example and looked at the documentation for pm.potential. My last try at the problem was:

model = pm.Model()
    with model:
        m = pm.Normal("m",0,sd=10,shape =len(cols))
        x = df[cols].to_numpy()
        w = np.ones(len(x))
        labels = df["label"]
        b = pm.Normal("b",0,sd=10)

        score = pm.math.dot(x,m) + b
        mu = pm.math.sigmoid(score)
        logp = pm.Bernoulli.dist(p=mu).logp(labels)
        error = pm.Potential("error",logp)
        nsample = 500
        ntune = int(0.5*nsample)
        trace = pm.sample(nsample,tune=ntune,cores=4

But this does not work either. I am happy for any help and recommendations.
Best regards,
Tim

Can you give us more details about what isn’t working, and/or some mock data which will reproduce the error?

I got it to run with

m = pm.Normal("m",0,sd=10)
x = np.random.rand(200, 3)
labels = np.random.randint(2, size=200)

but I had dimension mismatches if I tried to include a shape in m.

Hi,
thanks for the quick reply. In my example there are 8 inputs for each label, so the shape parameter is in fact 8. So i guess simple mock data for x could be:

x = np.random.rand(200,8)

But how did you come around using the shape parameter with more than just one x for each label? Omitting it does not solve the problem in my case.
Here is also the error and traceback I get (with the original example)

Traceback (most recent call last):
  File "baysian_logistic_model.py", line 40, in <module>
    logp = pm.Bernoulli.dist(p=mu).logp(labels)
  File "/anaconda3/envs/test/lib/python3.8/site-packages/pymc3/distributions/discrete.py", line 369, in logp
    return bound(
  File "/anaconda3/envs/test/lib/python3.8/site-packages/pymc3/distributions/dist_math.py", line 76, in bound
    return tt.switch(alltrue(conditions), logp, -np.inf)
  File "/anaconda3/envs/test/lib/python3.8/site-packages/pymc3/distributions/dist_math.py", line 82, in alltrue_elemwise
    ret = ret * (1 * c)
  File "/anaconda3/envs/test/lib/python3.8/site-packages/pandas/core/ops/common.py", line 64, in new_method
    return method(self, other)
  File "/anaconda3/envs/test/lib/python3.8/site-packages/pandas/core/ops/__init__.py", line 505, in wrapper
    return _construct_result(left, result, index=left.index, name=res_name)
  File "/anaconda3/envs/test/lib/python3.8/site-packages/pandas/core/ops/__init__.py", line 478, in _construct_result
    out = left._constructor(result, index=index)
  File "/anaconda3/envs/test/lib/python3.8/site-packages/pandas/core/series.py", line 279, in __init__
    data = com.maybe_iterable_to_list(data)
  File "/anaconda3/envs/test/lib/python3.8/site-packages/pandas/core/common.py", line 280, in maybe_iterable_to_list
    return list(obj)
  File "/anaconda3/envs/test/lib/python3.8/site-packages/theano/tensor/var.py", line 640, in __iter__
    for i in xrange(theano.tensor.basic.get_vector_length(self)):
  File "/anaconda3/envs/test/lib/python3.8/site-packages/theano/tensor/basic.py", line 4828, in get_vector_length
    raise ValueError("length not known: %s" % msg)
ValueError: length not known: Elemwise{mul,no_inplace} [id A] ''   
 |TensorConstant{(2000,) of 1} [id B]
 |Elemwise{mul,no_inplace} [id C] ''   
   |InplaceDimShuffle{x} [id D] ''   
   | |TensorConstant{1} [id E]
   |Elemwise{ge,no_inplace} [id F] ''   
     |sigmoid [id G] ''   
     | |Elemwise{add,no_inplace} [id H] ''   
     |   |dot [id I] ''   
     |   | |TensorConstant{[[2.197224..        ]]} [id J]
     |   | |m [id K]
     |   |InplaceDimShuffle{x} [id L] ''   
     |     |b [id M]
     |InplaceDimShuffle{x} [id N] ''   
       |TensorConstant{0} [id O]

Apologies, I must have got mixed up between a couple of attempts before, and copied parts of two (don’t you hate when you’d swear you’ve done the same thing twice, and suddenly it’s not working anymore?). This worked for me:

with pm.Model() as model:
    m = pm.Normal("m",0,sd=10, shape=8)
    x = np.random.rand(200,8)
    w = np.ones(len(x))
    labels=np.random.randint(2, size=200)
    b = pm.Normal("b",0,sd=10)
    score = pm.math.dot(x,m) + b
    mu = pm.math.sigmoid(score)
    logp = pm.Bernoulli.dist(p=mu).logp(labels)
    error = pm.Potential("error",logp)
    trace = pm.sample(500)

logp = pm.Bernoulli.dist(p=mu).logp(labels) seems to be the problem line in your error. Can you check the labels column for invalid values/NaNs/empty cells? Failing that maybe look at the values in mu (you can add a deterministic node and view the output in the trace, like det = pm.Deterministic('det', mu) - remove the the last few nodes so it will run, and make sure it’s working as expected

2 Likes

I’m not sure on the weighting aspect but for labels try

labels = df["label"].values

2 Likes

Thanks, this was it. I had it in in previous (not running) versions and I somewhere lost it.
Now it runs as it should.

Thanks a lot to both of you for your help.

1 Like

Thanks for the solution guys! Just wanna point out that I’m not sure there is any weighting here, as w = np.ones(len(x)) is defined by never used in the model – am I missing something?

1 Like

You are right, but from here on including the weights seemed straightforward to me. For completnes and if later users might have the same question here the final model

    with model:
        m = pm.Normal("m",0,sd=10,shape =len(cols))
        x = df[cols].to_numpy()
        w = df["weight"].to_numpy()
        labels = df["label"].to_numpy()

        b = pm.Normal("b",0,sd=10)
        score = pm.math.dot(x,m) + b
        mu = pm.math.sigmoid(score)

        logp = w*pm.Bernoulli.dist(p=mu).logp(labels)
        error = pm.Potential("error",logp)

        nsample = 200
        ntune = int(1.0*nsample)
        trace = pm.sample(nsample,tune=ntune,cores=4)

For dummy data just use

x = np.random.rand(200,8)
w = np.random.rand(len(x))
labels=np.random.randint(2, size=200)

as suggested above. I hope now everything is a bit more clear.

7 Likes

Perfect, thanks for including the full solution – you read my mind :wink:

how do you use posterior predictive distribution for new examples with this approach?

2 Likes

Hey everyone, is there any follow up to this?

Would you use pm.sample_posterior_predictive for this? I wonder how to eliminate the weighting for new examples.