Continuing the discussion from How to run logistic regression with weighted samples:
After the problem in the linked thread was solved, there were several further questions. I think these questions should addressed on their own.
As I understand it generating synthetic data is not possible with the proposed logistic regression model. If this is desired I would rather use a generative model using linear discriminant analysis. Here is a example I came up with alone so far. I used the well-known iris dataset to reproduce my example easily:
df = (
sns.load_dataset("iris")
[lambda df: df.species.isin(("setosa", "versicolor"))]
.assign(
label = lambda df: pd.Categorical(df.species).codes
)
)
input_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
seed = 42
with pm.Model() as model:
sigma = pm.HalfNormal("sigma", sigma=10, shape=len(input_cols))
mu = pm.Normal("mu", mu=0, sigma=10, shape=(2,len(input_cols)))
setosa = pm.Normal(
"setosa",
mu=mu[0],
sigma=sigma,
observed=df[df.species=="setosa"][input_cols].to_numpy()
)
versicolor = pm.Normal(
"versicolor",
mu=mu[1],
sigma=sigma,
observed=df[df.species=="versicolor"][input_cols].to_numpy()
)
trace = pm.sample(1000, random_seed=seed)
summary = az.summary(trace)
print(summary.to_markdown())
So here are my remaining questions:
- Is this the correct idea to generate synthetic samples?
- In the model here the gaussians for the different features are decoupled. How would a model using a multivariate (coupled) gaussian look like?
- Is it possible to include weights for the samples here as well?