How to create Non-Central Student's T distribution and what priors to use with the distribution?

I have been working with the following link,

Fitting empirical distribution to theoretical ones with Scipy:
(https://stackoverflow.com/questions/6620471/fitting-empirical-distribution-to-theoretical-ones-with-scipy-python)

I have been using my data and found out that the common distribution is the Non-Central Student’s T distribution. I couldn’t find the distribution and decided to have a look with scipy to understand how the distribution is formed. I created a custom distribution and I have few questions:

  • I would like to know if my approach to creating the distribution is right?

  • How can I implement the custom distribution into models?

  • Regarding the prior distribution, do I use same steps in normal distribution priors (mu and sigma) combined with halfnormed for degree of freedom and noncentral value?

My custom distribution:

import numpy as np
import theano.tensor as tt
from scipy import stats
from scipy.special import hyp1f1, nctdtr
import warnings
from pymc3.theanof import floatX
from pymc3.distributions.dist_math import bound, gammaln
from pymc3.distributions.continuous import assert_negative_support, get_tau_sigma
from pymc3.distributions.distribution import Continuous, draw_values, generate_samples

class NonCentralStudentT(Continuous):
    """"
    Parameters
    ----------
    nu: float
        Degrees of freedom, also known as normality parameter (nu > 0).
    mu: float
        Location parameter.
    sigma: float
        Scale parameter (sigma > 0). Converges to the standard deviation as nu increases. (only required if lam is not specified)
    lam: float
        Scale parameter (lam > 0). Converges to the precision as nu increases. (only required if sigma is not specified)
    """"'

    def __init__(self, nu, nc, mu=0, lam=None, sigma=None, sd=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        super(NonCentralStudentT, self).__init__(*args, **kwargs)
        if sd is not None:
            sigma = sd
            warnings.warn("sd is deprecated, use sigma instead", DeprecationWarning)
        self.nu = nu = tt.as_tensor_variable(floatX(nu))
        self.nc = nc = tt.as_tensor_variable(floatX(nc))
        lam, sigma = get_tau_sigma(tau=lam, sigma=sigma)
        self.lam = lam = tt.as_tensor_variable(lam)
        self.sigma = self.sd = sigma = tt.as_tensor_variable(sigma)
        self.mean = self.median = self.mode = self.mu = mu = tt.as_tensor_variable(mu)
        self.variance = tt.switch((nu > 2) * 1, (1 / self.lam) * (nu / (nu - 2)), np.inf)

        assert_negative_support(lam, 'lam (sigma)', 'NonCentralStudentT')
        assert_negative_support(nu, 'nu', 'NonCentralStudentT')
        assert_negative_support(nc, 'nc', 'NonCentralStudentT')

    def random(self, point=None, size=None):
        """
        Draw random values from Non-Central Student's T distribution.
        Parameters
        ----------
        point: dict, optional
            Dict of variable values on which random values are to be
            conditioned (uses default point if not specified).
        size: int, optional
            Desired size of random sample (returns one sample if not
            specified).
        Returns
        -------
        array
        """
        nu, nc, mu, lam = draw_values([self.nu, self.nc, self.mu, self.lam], point=point, size=size)
        return generate_samples(stats.nct.rvs, nu, nc, loc=mu, scale=lam ** -0.5, dist_shape=self.shape, size=size)

    def logp(self, value):
        """
        Calculate log-probability of Non-Central Student's T distribution at specified value.
        Parameters
        ----------
        value: numeric
            Value(s) for which log-probability is calculated. If the log probabilities for multiple
            values are desired the values must be provided in a numpy array or theano tensor
        Returns
        -------
        TensorVariable
        """
        nu = self.nu
        nc = self.nc
        mu = self.mu
        lam = self.lam

        n = nu * 1.0
        nc = nc * 1.0
        x2 = value * value
        ncx2 = nc * nc * x2
        fac1 = n + x2
        trm1 = n / 2. * tt.log(n) + gammaln(n + 1)
        trm1 -= n * tt.log(2) + nc * nc / 2. + (n / 2.) * tt.log(fac1) + gammaln(n / 2.)
        Px = tt.exp(trm1)
        valF = ncx2 / (2 * fac1)
        trm1 = tt.sqrt(2) * nc * value * hyp1f1(n / 2 + 1, 1.5, valF)
        trm1 /= np.asarray(fac1 * tt.gamma((n + 1) / 2))
        trm2 = hyp1f1((n + 1) / 2, 0.5, valF)
        trm2 /= np.asarray(np.sqrt(fac1) * tt.gamma(n / 2 + 1))
        Px *= trm1 + trm2
        return bound(Px, lam > 0, nu > 0, nc > 0)

    def logcdf(self, value):
        """
        Compute the log of the cumulative distribution function for Non-Central Student's T distribution
        at the specified value.
        Parameters
        ----------
        value: numeric
            Value(s) for which log CDF is calculated. If the log CDF for multiple
            values are desired the values must be provided in a numpy array or theano tensor.
        Returns
        -------
        TensorVariable
        """
        nu = self.nu
        nc = self.nc

        return nctdtr(nu, nc, value)

My Custom model:

with pm.Model() as model:
    # Prior Distributions for unknown model parameters:
    mu = pm.Normal('sigma', 0, 10)
    sigma = pm.Normal('sigma', 0, 10)
    nc= pm.HalfNormal('nc', sigma=10)
    nu= pm.HalfNormal('nu', sigma=1)

    # Observed data is from a Likelihood distributions (Likelihood (sampling distribution) of observations):
    => (input custom distribution) observed_data = pm.Beta('observed_data', alpha=alpha, beta=beta, observed=data)

    # draw 5000 posterior samples
    trace = pm.sample(draws=5000, tune=2000, chains=3, cores=1)

    # Obtaining Posterior Predictive Sampling:
    post_pred = pm.sample_posterior_predictive(trace, samples=3000)
    print(post_pred['observed_data'].shape)
    print('\nSummary: ')
    print(pm.stats.summary(data=trace))
    print(pm.stats.summary(data=post_pred))

I redesigned the custom model to include the custom distribution, however, I keep on getting error based on the equations used to get the likelihood distribution or sometimes tensor locks down and the code just freeze. Find my code below,

with pm.Model() as model:
                # Prior Distributions for unknown model parameters:
                mu = pm.Normal('mu', mu=0, sigma=1)
                sd = pm.HalfNormal('sd', sigma=1)
                nc = pm.HalfNormal('nc', sigma=10)
                nu = pm.HalfNormal('nu', sigma=1)
                
                # Custom distribution:
                # observed_data = pm.DensityDist('observed_data', NonCentralStudentT, observed=data_list)

                # Observed data is from a Likelihood distributions (Likelihood (sampling distribution) of observations):
                observed_data = NonCentralStudentT('observed_data', mu=mu, sd=sd, nc=nc, nu=nu, observed=data_list)
                
                # draw 5000 posterior samples
                trace_S = pm.sample(draws=5000, tune=2000, chains=3, cores=1)
        
                # Obtaining Posterior Predictive Sampling:
                post_pred_S = pm.sample_posterior_predictive(trace_S, samples=3000)
                print(post_pred_S['observed_data'].shape)
                print('\nSummary: ')
                print(pm.stats.summary(data=trace_S))
                print(pm.stats.summary(data=post_pred_S))

Hi Mojo_ild

Your general approach is correct, your most recent model is good. no need to use DensityDist, just write the custom distribution in a class and call it with observed.

But you need the logp function to be written purely in Theano. I see you are using hyp1f1 from scipy which is numpy based, so you will need to write this in Theano manually.

Thank you for the comment. Well, I am looking online in order to convert the function to theano, the only thing that I found to define the function is from the following GitHub link hyp1f1 function GitHub

Will this be enough to use in order to convert the function into theano? In addition, I have a question, it is okay to use NumPy arrays with theano?

Also, I thought of another way but I am not sure if this can be implemented, I looked into the nct function in scipy and what they wrote the following,

If Y is a standard normal random variable and V is an independent chi-square random variable ( chi2 ) with k degrees of freedom, then

X=(Y+c) / sqrt(V/k)

has a non-central Student’s t distribution on the real line. The degrees of freedom parameter k (denoted df in the implementation) satisfies k>0 and the noncentrality parameter c (denoted nc in the implementation) is a real number.

The probability density above is defined in the “standardized” form. To shift and/or scale the distribution use the loc and scale parameters. Specifically, nct.pdf(x, df, nc, loc, scale) is identically equivalent to nct.pdf(y, df, nc) / scale with y = (x - loc) / scale .

So, I thought of only using the priors as normal and chi2 random variables code part in their distributions and use the degree of freedom variable as mentioned before in the code into the equation mentioned in SciPy, will it be enough to get the distribution?

You could probably use that formula to combine standard normal and chi-sq samples into nct samples. But to run inference you need the formula for the logpdf of the distribution.

This link may also be useful to you. The code looks quite challenging to port into Theano to be honest.

You can pass numpy arrays into PyMC3 models as inputs. You should be able to use them for static data within Theano functions. But where your input is a theano variable and your output is a theano variable then you shouldn’t use numpy. Say if you want to use a PyMC3 distribution as an input.

Hope this is helpful, sorry I cannot be more useful

I managed to run the code in the link about fitting empirical distribution and found out the the second best was the student t distribution, so, I will be using this. Thank you for your help. I just have a side question, I ran my model with student t distribution but I got these warnings:

There were 52 divergences after tuning. Increase target_accept or reparameterize.
The acceptance probability does not match the target. It is 0.7037574708196309, but should be close to 0.8. Try to increase the number of tuning steps.
The number of effective samples is smaller than 10% for some parameters.

I am just confused about these warnings, Do you have any idea what it means?

Also, another side question. I got the trace of my model and got this,

<MultiTrace: 3 chains, 5000 iterations, 5 variables>

I know that the first three is nu, mu, and sigma. How can I find out the rest of the variables?

To simply get the variable names you can use trace.varnames

For diagnostics it’s best to use the library ArviZ.

import arviz as az
.....

    trace = pm.sample()
    idata = az.from_pymc3(trace)

then you can use the many useful functions from ArviZ such as, az.plot_trace(), az.summary(), az.plot_posterior()

Thank you, this managed to help me a lot!! I have been using the T and want to try your idea to use a normal distribution with chi-squared. I am encountering this issue where theano keeps on creating lock_dir file and stops my code from running its been a long time since I started running my code and it is still not done. Is there a way to prevent theano from creating the file?

I’m not too sure regarding the lock. Are you running on Windows? I vaguely remember having some issues on Windows but I switched to Linux which has the same sampling speed but faster compilation in my experience.

If you are on Windows you could attempt to use WSL.

Yes, I am using Spyder IDE in windows