Estimating Distribution of Transformed Variables

I have two random variables, X and Y, which are obtained through complex nonlinear transformations of random variables U and V, which follow a bivariate normal distribution. These transformations involve some parameters that need to be estimated. The challenge is that I only have a series of observed values for X and Y, and I’m uncertain about the distribution underlying (X, Y). However, I do have a clear expression for the transformation from (U, V) to (X, Y), which is nonlinear. Is it possible to use the PyMC package to address this scenario?

Any advice would be dearly appreciated.

Hi Karie!

You are definitely allow to do non-linear, deterministic transformations on random variables, as long as you the transformations can be written as pytensor operations. Pytensor is the computational backend for PyMC, and it pretty much has every numpy has. As an example, I estimate a parameterized eggholder function, which I think we can agree is non-linear:

import numpy as np
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
import pytensor.tensor as pt

rng = np.random.default_rng(sum(map(ord, 'eggholder')))

def eggholder(x, alpha, beta):
    x_shifted = x[1] + alpha
    return (-x_shifted * np.sin(np.sqrt(abs(x[0]/beta + x_shifted)))
            -x[0]      * np.sin(np.sqrt(abs(x[0]      - x_shifted))))

true_alpha = 47.0
true_beta = 2.0

xx, yy = np.meshgrid(*[np.linspace(-512, 512, 100)] * 2)
mu = eggholder([xx, yy], true_alpha, true_beta)
noise = rng.normal(scale=10, size=(100, 100))

data = mu + np.diag(noise)

image

Here’s the PyMC model. All I did was re-write the eggholder function using pt. instead of np., and put priors on alpha and beta:

def eggholder_pt(x, y, alpha, beta):
    y_shifted = y + alpha
    return (-y_shifted * pt.sin(pt.sqrt(pt.abs(x / beta + y_shifted)))
            -x         * pt.sin(pt.sqrt(pt.abs(x        - y_shifted))))

with pm.Model() as mod:
    x_data = pm.ConstantData('x_data', xx)
    y_data = pm.ConstantData('y_data', yy)
    alpha = pm.Normal('alpha', sigma=10)
    beta = pm.Normal('beta', sigma=10)
    mu = pm.Deterministic('mu', eggholder_pt(x_data, y_data, alpha, beta))
    
    sigma = pm.Exponential('sigma', 0.1)
    obs = pm.Normal('obs', mu=mu, sigma=sigma, observed=data)
    idata = pm.sample_smc(kernel=pm.smc.kernels.MH)

I had to try a couple different sampling options to find something that worked; in general just use pm.sample with the defaults. I include what I actually did just for completeness. The point is that, in the end, I recovered the true parameters through this non-linear mess:

1 Like

Thanks for your reply.
But when I run this example, I find that I can not get the result.

WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
Initializing SMC sampler...
Sampling 6 chains in 6 jobs
 |--------------------| 0.00% [0/100 00:00<?]

Well the specific example isn’t that important, I just wanted to show that PyMC can estimate parameters just fine after a highly non-linear transformation. As noted, you don’t typically want to use the smc sampler.

The pytensor.tensor.blas warning means you’re going to have performance issues. Based on what you posted the example is running fine, just slowly. How did you install PyMC? Did you follow the official installation guide? If you’re on a mac with an ARM64 chip, did you install accelerate BLAS?

I understand your point, and I greatly appreciate your response! However, upon further examination of your example, I noticed that in your case, the variable “obs” follows a normal distribution. In contrast, in my situation, I am not certain about the specific distribution followed by “obs.” To borrow from your example, my scenario should be framed as follows: Given the function eggholder_pt(x, y, alpha, beta), where x and y follow a bivariate normal distribution, and after undergoing the transformation by eggholder_pt, the result is mu. I possess observational values for mu, but I lack knowledge regarding the distribution that mu adheres to. In such a scenario, how should I proceed?

    mu = pm.Normal('mu', 0, sigma=1)
    kappa = pm.Normal('kappa', 0, sigma=1)
    theta = pm.Normal('theta', 0, sigma=1)
    omiga = pm.InverseGamma('omiga', alpha=2, beta=0.005)

    std_dev = pm.Deterministic('std_dev', omiga / 2)
    phi = pm.Normal('phi', 0, sigma=std_dev) 

    V0 = pm.TruncatedNormal('V0', mu=0.0225, sigma=0.005, lower=0)

    cov_matrix = pm.Deterministic('cov_matrix', np.array([[1., phi], [phi, phi**2 + omiga]]))
    epislons = pm.MvNormal('epislons', mu=np.zeros(2), cov=cov_matrix, shape=(T, 2))
    
    y = pm.Deterministic('y', heston_pt(epislons, V0, kappa, theta, mu)) 
    #heston_pt involves a nonlinear transformation with respect to epsilons, incorporating certain parameters to be estimated
    
    obs = pm.???('obs', y=y, observed=Y) 
    # the distribution of y is unknown, but I can get its observation

Just because the likelihood function is normal doesn’t mean the posterior will be a normal. Actually, if that were true, we would never have to do MCMC! A normal likelihood function is more a statement about your prior knowledge on how the data are generated – something like, “it should be clustered around the output of heston_pt”.

So a normal likelihood function is always a reasonable place to start, unless you know it is a priori inappropriate (e.g. y is strictly positive).

1 Like

Thank you for your response, I understand now!

However, I encountered another issue when I was defining the ‘heston_pt’ function:

def heston_pt(epsilons, V0, kappa, theta, mu):
    V[0] = V0
    for i in range(T):
        V[i+1] = epsilons[i, 1] * pt.sqrt(V[i] * dt) + (1 - kappa * dt) * V[i] + kappa * theta * dt
    Y = epsilons[:, 0] * pt.sqrt(V[:-1] * dt) - 0.5 * V[:-1] * dt + mu * dt
    return Y

The code error is as follows:

    V[0] = V0

ValueError: setting an array element with a sequence.

It seemed that I can’t define V0 as the first element of V.

Once again, I sincerely appreciate your patient assistance.

You’re getting the error because V is a numpy array, but V0 (and all the results of your other operations) are pytensor symbolic tensors. Numpy doesn’t know what to do with those.

Since you’re doing recursive computation, you’re going to need to use a special pytensor operation called scan. You will need to use that instead of a basic python loop. Docs are here, examples of making a scan-based model are here and here.

Since your function is autoregressive, you also get the answer to your distribution question en passant: you will use a pm.CustomDist, with the dist argument as a function that returns the result of your scan, like in the two linked examples. PyMC will automatically infer the logp of your graph and give you the correct distribution.

Thank you very much for your explanation! My code has finally debugged successfully!

I have another question: the sampling process takes a very long time. I hand-wrote an MCMC sampling procedure in Python, and running my self-written program only takes a few minutes. However, I believe using the PyMC package would definitely be much faster than the program I wrote myself. So, I’d like to ask if there’s any way to significantly increase the sampling speed?

My code is as follows:

import numpy as np
import pandas as pd
import pymc as pm
from pymc.pytensorf import collect_default_updates
import xarray as xr
import arviz as az
import pytensor
import pytensor.tensor as pt
import multiprocessing


BURNIN = 5000
N = 15000
dt = 1/240


stock_price_data = pd.read_csv("convvalue_data//2023-05-12&110043.SH.csv")
Y = np.diff(np.log(stock_price_data["convvalue"]))
T = len(Y)
V = pt.zeros(T)


def heston_dist(epsilons, V0, kappa, theta, mu, size):
    
    def V_step(*args):
        epsilons, prev_V, kappa, theta = args
        new_V = epsilons[1] * pt.sqrt(prev_V * dt) + (1 - kappa * dt) * prev_V + kappa * theta * dt
        return new_V, collect_default_updates(inputs=args, outputs=[new_V])
    
    V, updates = pytensor.scan(fn=V_step, sequences=[epsilons], outputs_info=[V0], non_sequences=[kappa, theta])
    
    Y = epsilons[:, 0] * pt.sqrt(V[:-1] * dt) - 0.5 * V[:-1] * dt + mu * dt 
    
    return Y


def calc_cov_matrix(phi, omega):
    cov_matrix = pm.math.stack([
        [1., phi],
        [phi, phi**2 + omega]
    ])
    return cov_matrix
    

with pm.Model() as mod:    
    # start
    phi_start = np.random.normal(0, 0.01)
    while True:
        V0_start = np.random.normal(0.0225, 0.005)
        if V0_start > 0:
            break
    start = {'mu': 0.1, 'kappa': 5, 'theta': 0.0225, 'phi': phi_start, 'omega': 0.02, 'V0': V0_start} 
    
    # prior distribution
    mu = pm.Normal('mu', 0, sigma=1)
    kappa = pm.Normal('kappa', 0, sigma=1)
    theta = pm.Normal('theta', 0, sigma=1)
    omega = pm.InverseGamma('omega', alpha=2, beta=0.005)
    
    std_dev = pm.Deterministic('std_dev', omega / 2)
    phi = pm.Normal('phi', 0, sigma=std_dev)
    
    V0 = pm.TruncatedNormal('V0', mu=0.0225, sigma=0.005, lower=0)
    
    cov_matrix = pm.Deterministic('cov_matrix', calc_cov_matrix(phi, omega))
    epsilons = pm.MvNormal('epsilons', mu=np.zeros(2), cov=cov_matrix, shape=(T, 2))
    
    y_obs =  pm.MutableData('y_obs', Y)
    y = pm.CustomDist('obs', epsilons, V0, kappa, theta, mu, dist=heston_dist, observed=y_obs)
    
    idata = pm.sample_smc(draws=BURNIN+N, start=start)

Here are some implemented information:

  • I’m using the Windows operating system.
  • I installed PyMC using the command pip install pymc
  • The version information is as follows:
pymc.__version__ =  '5.6.1'
pytensor.__version__ ==  '2.12.3'

Just to you to be able to test here is the stock_price_data that I am using:

2023-05-12&110043.SH.csv (36.0 KB)

You have to compare effective sample size per unit of time to have a fair comparison. PyMC uses NUTS which requires potentially expensive gradient evaluations (I assume your MCMC algorithm does not) but samples much more efficiently.

Successful installation (without the warning you were seeing before) can also have a big impact on speed. In addition, note that with PyMC you can use JAX/Numba based samplers that can be even faster (e.g, if you have JAX + GPU combo)

Performance is also very model dependent. Usually the biggest source of slowdown is a badly specified model (redundant parameters, or bad priors). This can affect both sampling speed and effective sample size.

1 Like

When I was running the code, an error occurred.
I think it’s about setting starting values. How do I address this problem?

Any advice would be dearly appreciated.

    point = Point({v.name: init_rnd[v.name][i] for v in self.variables}, model=self.model)

TypeError: 'float' object is not subscriptable

Instead of drawing T epsilons then scanning over them, pass in cov_matrix as a non_sequence, and draw an epsilon each timestep with epsilon = pm.MvNormal.dist(mu=0, cov=cov_matrix).

Also I regret using sample_smc in my first example, sorry about that. It should not be your go-to. You should use pm.sample, which will use NUTS, which is better in 99.9999% of situations.

Also I strong dis-recommend estimating scan-based models without using the numba (nutpie) or JAX (numpyro, blackjax) samplers. I see order of magnitude speedups when I use these over the default.

Can you give me some examples about using the numba (nutpie ) or JAX (numpyro , blackjax ) samplers? Thanks very much!

The model in the 2nd example I linked (Bayesian Holt-Winters) uses numpyro

Thank you for your advive.
But in the following code, epsilons will still be needed for the iterative values in each loop iteration. If I follow your approach, it seems like this might not be achievable.

Y_pred = epsilons[:, 0] * pt.sqrt(V[:-1] * dt) - 0.5 * V[:-1] * dt + mu * dt

Your scan can return both V and Y at each time-step. It’s strange to me to compute Y all in one go at the end instead of doing it step-by-step.

I apologize for the repeated interruptions, but I’m still a bit confused about the heston_dist function.

Following your advice, I made the following modifications, but there are still some issues.

    def heston_dist(V0, cov_matrix, kappa, theta, mu, size):
        
        def step(*args):
            prev_V, cov_matrix, kappa, theta, mu = args
            
            epsilons = pm.MvNormal.dist(mu=0, cov=cov_matrix)
            
            new_V = epsilons[1] * pt.sqrt(prev_V * dt) + (1 - kappa * dt) * prev_V + kappa * theta * dt
            new_Y = epsilons[0] * pt.sqrt(prev_V * dt) - 0.5 * prev_V * dt + mu * dt
            return (new_V, new_Y), collect_default_updates(inputs=args, outputs=[new_V, new_Y])
        
        (V, Y), updates = pytensor.scan(fn=step, sequences=[None], 
                                        outputs_info=[V0, None], 
                                        non_sequences=[cov_matrix, kappa, theta, mu],
                                        n_steps=T)
        
        return Y
Traceback (most recent call last):
  File ".\3. 参数估计.py", line 98, in <module>
    main()
  File ".\3. 参数估计.py", line 87, in main
    y = pm.CustomDist('obs',
  File "D:\ProgramData\Anaconda\envs\pymc_env\lib\site-packages\pymc\distributions\distribution.py", line 958, in __new__
    return _CustomSymbolicDist(
  File "D:\ProgramData\Anaconda\envs\pymc_env\lib\site-packages\pymc\distributions\distribution.py", line 308, in __new__
    rv_out = cls.dist(*args, **kwargs)
  File "D:\ProgramData\Anaconda\envs\pymc_env\lib\site-packages\pymc\distributions\distribution.py", line 622, in dist
    return super().dist(
  File "D:\ProgramData\Anaconda\envs\pymc_env\lib\site-packages\pymc\distributions\distribution.py", line 385, in dist
    ndim_supp = cls.rv_op(*dist_params, **kwargs).owner.op.ndim_supp
  File "D:\ProgramData\Anaconda\envs\pymc_env\lib\site-packages\pymc\distributions\distribution.py", line 651, in rv_op
    dummy_rv = dist(*dummy_dist_params, dummy_size_param)
  File ".\3. 参数估计.py", line 48, in heston_dist
    (V, Y), updates = pytensor.scan(fn=step, sequences=[None],
  File "C:\Users\xiajy\AppData\Roaming\Python\Python38\site-packages\pytensor\scan\basic.py", line 594, in scan
    actual_slice = seq["input"][k - mintap_proxy]
TypeError: 'NoneType' object is not subscriptable

To make the purpose of this function clearer, I have reorganized it: My observations are related to a stochastic process Y, where the stochastic process Y is generated by stochastic processes V and epsilons, along with some other parameters to be estimated.

Don’t pass [None] to sequences, just leave away the sequences argument entirely

When I run the code, I encounter the following error. But I don’t think I have missed any input.

pytensor.graph.utils.MissingInputError: OpFromGraph is missing an input: *1-<RandomGeneratorType>