Classification with weighted samples: lda and synthetic data

Continuing the discussion from How to run logistic regression with weighted samples:

After the problem in the linked thread was solved, there were several further questions. I think these questions should addressed on their own.

As I understand it generating synthetic data is not possible with the proposed logistic regression model. If this is desired I would rather use a generative model using linear discriminant analysis. Here is a example I came up with alone so far. I used the well-known iris dataset to reproduce my example easily:

df = (
    sns.load_dataset("iris")
    [lambda df: df.species.isin(("setosa", "versicolor"))]
    .assign(
        label = lambda df: pd.Categorical(df.species).codes
    )
)
input_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
seed = 42

with pm.Model() as model:
    sigma = pm.HalfNormal("sigma", sigma=10, shape=len(input_cols))
    mu = pm.Normal("mu", mu=0, sigma=10, shape=(2,len(input_cols)))

    setosa = pm.Normal(
        "setosa", 
        mu=mu[0],
        sigma=sigma,
        observed=df[df.species=="setosa"][input_cols].to_numpy()
        )

    versicolor = pm.Normal(
        "versicolor", 
        mu=mu[1],
        sigma=sigma,
        observed=df[df.species=="versicolor"][input_cols].to_numpy()
        )
    trace = pm.sample(1000, random_seed=seed)
summary = az.summary(trace)
print(summary.to_markdown())

So here are my remaining questions:

  1. Is this the correct idea to generate synthetic samples?
  2. In the model here the gaussians for the different features are decoupled. How would a model using a multivariate (coupled) gaussian look like?
  3. Is it possible to include weights for the samples here as well?
1 Like