Gaussian Mixture + Linear Regression for non-linear data

Let me preface this by saying that I am new to Python, pymc3, and statistics, so I am likely to have made some basic mistakes.

I am currently trying to model some data produced by the following data generating process:

time = 0
base_mu = 100 + 100 * param_2
base_sig = 10 + 10 * param_2
sub_mu = 20 * (1 + param_2) * (1 + param_1)
sub_sig = 2 * (1 + param_2) * (1 + param_1)
lam = 1 / (1 + param_1)
base_time = pm.Normal.dist(base_mu, base_sig).random(size=1)
time += base_time

n = pm.Poisson.dist(lam).random(size=1)
for i in range(n):
    time += pm.Normal.dist(sub_mu, sub_sig).random(size=1)
return time

The particular data generating process is not necessarily important, but I need to be able to contend with something at least as messy. This process is effectively a gaussian mixture, but the weights on the various models depend on the parameters, and the distribution parameters can be non-linear in the observed parameters.

I’d like to estimate this through a linear regression/gaussian mixture. First question: Is this possible? I know that the non-linearity of the data generating process will prevent a linear regression from actually capturing the process, but I am under the impression that gaussian mixtures are robust enough to handle data that they don’t literally describe. Is this true? (I want to avoid just reproducing the data generating process because I expect to eventually have real world data whose relationship to the underlying parameters I don’t know.)

I attempted to estimate this model following the process described in https://docs.pymc.io/notebooks/dependent_density_regression.html, but was ultimately unsuccessful. Here’s what I tried:

import numpy as np
import pymc3 as pm
from theano import shared, tensor as tt

def generate_params_from_env(param_1, param_2):
    base_mean = 100 * (1 + param_2)
    base_std = 10 * (1 + param_2)
    
    sub_mean = 20 * (1 + param_2) * (1 + param_1)
    sub_std = 2 * (1 + param_2) * (1 + param_1)
    
    lam = 1 / (1 + param_1)

    return [[base_mean, base_std], [sub_mean, sub_std], lam]

def random_draw(params):
    time = 0
    base_time = pm.Normal.dist(params[0][0], params[0][1]).random(size=1)
    time += base_time

    n = pm.Poisson.dist(params[2]).random(size=1)

    for i in range(n):
        time += pm.Normal.dist(params[1][0], params[1][1]).random(size=1)
    return time

def generate_env(n):
    ret = []
    for i in range(n):
        ret.append([np.random.random(), np.random.random()])
    return ret

def random_process(env, n):
    data = []
    for i in range(n):
        data.append(random_draw(generate_params_from_env(env[i][0], env[i][1])))
    return np.array(data)

#Helper, copied from https://docs.pymc.io/notebooks/dependent_density_regression.html
def std_norm_cdf(z):
    return 0.5 * (1 + tt.erf(z / np.sqrt(2)))

#Helper, copied from https://docs.pymc.io/notebooks/dependent_density_regression.html
def stick_breaking(v):
    return v * tt.concatenate([tt.ones_like(v[:, :1]),
                               tt.extra_ops.cumprod(1 - v, axis=1)[:, :-1]],
                               axis=1)

#copied from https://docs.pymc.io/notebooks/dependent_density_regression.html
def get_w(param_1, param_2, n_models):
    d = pm.Normal('d', mu=0, sd=5, shape=n_models)
    e = pm.Normal('e', mu=0, sd=5, shape=n_models)
    f = pm.Normal('f', mu=0, sd=5, shape=n_models)
    thing = d + e * param_1 + f * param_2
    v = std_norm_cdf(thing)
    return stick_breaking(v)

def construct_model(data, env, n_models):
    models = []
    env_t = np.transpose(np.array(env))

    a = pm.Normal('a', mu=100, sd=100, shape=n_models)
    b = pm.Normal('b', mu=40, sd=100, shape=n_models)
    c = pm.Normal('c', mu=40, sd=100, shape=n_models)

    g = pm.Normal('g', mu=40, sd=100, shape=n_models)

    param_1 = shared(env_t[0][:, np.newaxis], broadcastable=(False, True))
    param_2 = shared(env_t[1][:, np.newaxis], broadcastable=(False, True))

    mu = pm.Deterministic('mu', a +  b * param_1 + c * param_2)
    w_ = get_w(param_1, param_2, n_models)
    w = pm.Deterministic("w", w_/w_.sum(axis=1, keepdims=True))
    obs = pm.NormalMixture('obs', w, mu, sigma=g, observed=data)

if __name__ == "__main__":
    np.random.seed(123)

    trials = 1000
    n_models = 30
    env = generate_env(trials)
    data = random_process(env, trials)
    with pm.Model() as model:
        construct_model(data, env, 30)
        trace = pm.sample(chains=1, cores=1, init='adapt_diag')
        print(pm.summary(trace))

This code will compile and it will run. However, I run into two problems. The first is that it throws a number of alarming errors while running, to the tune of:

ERROR (theano.gof.opt): Optimization failure due to: local_grad_log_erfc_neg
ERROR (theano.gof.opt): node: Elemwise{true_div,no_inplace}(Elemwise{mul,no_inplace}.0, Elemwise{erfc,no_inplace}.0)
ERROR (theano.gof.opt): TRACEBACK:
ERROR (theano.gof.opt): Traceback (most recent call last):
  File "C:\Users\censored\AppData\Local\Continuum\anaconda3\envs\censored\lib\site-packages\theano\gof\opt.py", line 2034, in process_node
    replacements = lopt.transform(node)
  File "C:\Users\censoredAppData\Local\Continuum\anaconda3\envs\censored\lib\site-packages\theano\tensor\opt.py", line 6789, in local_grad_log_erfc_neg
    if not exp.owner.inputs[0].owner:
AttributeError: 'NoneType' object has no attribute 'owner'

Secondly, the model it produces is very poor. The output of the model does not match the output of the process well at all. In the below plots, the left histogram depicts the true data generating process, while the right histogram depicts the output of the model. The first is the case when param_1 and param_2 are drawn from a distribution anew for each data point, while the second one holds param_1 and param_2 fixed for a large number of samples.

overall-orig

conditional-orig

What am I doing wrong? Is this a model error, a code error, or am I just trying to do completely the wrong thing?