Hi, I’m new to PyMC3 and am using it in my master dissertation. I have a dataset of roughly 1.3 million datapoints, each of 20 dimensions, and 40% of the data matrix are missing.
My goal is to reduce its dimensionality for further analysis. The way I deal with missing values is simple - assign a 20 dimensional Bernoulli mask, that either returns the “true” value (when entry is observed) or a placeholder (when the entry is missing). To make it run on ADVI, I have integrated away the Bernoulli latent and got a mixture of a Dirac distribution centered on missing placeholder, and a Gaussian component centered on the “true” value.
My goal is to recover the full factor loading matrix (NxR) for further analysis; however, given my datasize, I couldn’t specify my V matrix to have a shape of (N, R). Is there any simple fix for this problem? Or should I infer my DxR factor matrix with minibatches and algorithmically obtain the NxR loading matrix?
On a slightly unrelated note, I am also looking to expand my current implementation to use the Indian Buffet Process, using the following generative approach:
pi_r ~ Beta(alpha / R, 1)
z_nr ~ Bernoulli(pi_r)
Factor loading matrix becomes V x Z
Which I believe is not too difficult to implement, since I can again, integrate out the Beroulli latents Z to obtain another mixture. Any thoughts?
Below is my code:
MISSING_PLACEHOLDER = -9999
df_data = df_data.fillna(MISSING_PLACEHOLDER)
D = 20
N = len(df_data)
R = 3
batch_size = 100
minibatch = pm.Minibatch(df_data, batch_size=batch_size)
class NormalMissingValMixture(pm.Continuous):
def __init__(self, beta, mu, precision, missing_mu, D, *args, **kwargs):
super(NormalMissingValMixture, self).__init__(*args, **kwargs)
self.beta = beta
self.mu = mu
self.missing_mu = missing_mu
self.precision = precision
self.D = D
self.mean = pm.floatX(np.zeros(D))
def logp(self, x):
missing_likelihood = tt.log(self.beta) + pm.Constant.dist(self.missing_mu).logp(x)
observed_likelihood = tt.log(1 - self.beta) + pm.Normal.dist(self.mu, tau=self.precision).logp(x)
logps = [missing_likelihood, observed_likelihood]
return tt.sum(logsumexp(tt.stacklists(logps)[:, :x.shape[0]], axis=0))
with pm.Model() as model:
ard_prior = pm.InverseGamma('ard', 1., 1.)
U = pm.Normal('factors', 0, ard_prior, shape=(D, R))
V = pm.Normal('loadings', 0, ard_prior, shape=(batch_size, R))
tau = pm.Gamma('tau', 1.0, 1.0, shape=batch_size * D)
lambda_ = pm.Uniform('lambda', 0, 5, shape=batch_size * D)
# True Distribution
mu = tt.dot(V, U.T)
precision = tt.reshape(tau * lambda_, (batch_size, D))
beta = pm.Beta('mask', 1, 1, shape=D)
Y = NormalMissingValMixture('observations', beta, mu, precision, MISSING_PLACEHOLDER, D, observed=minibatch)
advifit = pm.fit(100000, method='advi', callbacks=[pm.callbacks.CheckParametersConvergence()])
Cheers,
Hugo