When treating a small data, how to choose a model

Hello everyone.

I’m a junior data analyst using Bayesian modeling.
I have a question about the way choosing model and creating a more accurate model.

Now, I’m trying to use Gaussian Mixture modeling to derive the posterior distribution
the below is the status of the problem I’m facing.

Status

  • the amount of the data I use as explanatory variable is 365 per variable
  • the distribution of the target variable has a two peak, so I made two prior distribution per variable.
  • the range of one posterior distribution ‘beta’ is about -20 to less than 20, and the mean is around zero
  • about the other, the range is also around zero.

For this problem, I’ve just tried a Gaussian mixture regression model with pymc3.
the below is the code of this.

type or paste code here
with pm.Model() as model:
        weight = pm.Dirichlet('weight',  a=np.array([1, 1]), shape = (1, 2))
        alpha = pm.Normal('alpha', mu = 0, sigma = 10, shape = (1, 2))
        beta = pm.Normal('beta1', mu = 0, sigma = 10, shape = (1, 2))
        sigma = pm.HalfNormal('sigma', sigma=10, shape=2)
      
        mu = alpha + beta1 * X_1['last_year_pb_mean'].values[:, None]
    
        pv_obs = pm.NormalMixture('pb_obs', w = weight, mu=mu, sigma=sigma, observed=np.log(PRTIMES['pb_rounded']))
        trace = pm.sample(2000,tune= 1000,target_accept= 0.99, cores = 1)
        pm.plot_trace(trace, compact=True)
        pm.plot_posterior(trace, var_names=['beta'], hdi_prob=0.95)

        y_1 = pm.sample_posterior_predictive(trace, samples=1000, model=model)
        y_pred_1 = y_1['pb_obs']

pm.traceplot(trace)

but the result of this is so bad, the posterior is far from the actual model.(figure2, the left is posterior and the right is actual.)

I keeps searching some information, I can’t have find solution.

how can the problem be solved?

here is the results, figures and the waic of this model is 2.529136938170937

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha[0, 0] 0.262 9.994 -18.358 18.348 0.174 0.158 3313.0 2881.0 1.0
alpha[0, 1] 3.329 0.009 3.312 3.345 0.000 0.000 2765.0 2383.0 1.0
beta1[0, 0] -0.139 9.907 -17.974 18.936 0.190 0.161 2739.0 2448.0 1.0
beta1[0, 1] 0.058 0.009 0.040 0.073 0.000 0.000 3536.0 2517.0 1.0
weight[0, 0] 0.003 0.003 0.000 0.008 0.000 0.000 2776.0 1736.0 1.0
weight[0, 1] 0.997 0.003 0.992 1.000 0.000 0.000 2776.0 1736.0 1.0
sigma[0] 7.891 6.236 0.006 19.008 0.112 0.079 1480.0 654.0 1.0
sigma[1] 0.171 0.006 0.159 0.183 0.000 0.000 3503.0 2466.0 1.0

figure0

figure1

figure2

Welcome!

Do you have any simulated data to work with? It is often suggested to develop models (at least initially) using data where you have full control (or knowledge) over the underlying generative process.

1 Like

Thank for answering!

No, I don’t have simulation data.

That looks a meaningful discovery for me.
I thank for your advice.

I’ll try to make this.
Then, if I’m troubled, can I ask you a question?

If you don’t have the true, underlying model, then I’m not sure what to make of this statement.

Of course!

1 Like