# Balancing Likelihood Terms for Extended LDA Model

Hi there,
I’ve been developing a version of LDA to remove a topic consisting of a known contaminating mixture and have two questions about the model specification. The model is trained on a sparse matrix of doc x word counts `X`. The model consists of an LDA model made up of standard K-1 topics. However in `theta`, to get the Kth topic proportion it is drawn from a normal distribution with a known mean and sd and normalized by the sum of rows in `X`. In addition, in `phi` the Kth topic has a known/fixed mixture of words. I’m not trying to learn the gaussian’s parameters, or phi K’s mixture, I’m trying to constrain the LDA model using them.

From my understanding, the likelihood of the whole model should be the LDA log likelihood `ll` plus the gaussian log likelihood `ambientll` (I’ve tried balancing the two terms by scaling them by the variables on which they depend, which works somewhat). Is there a good way to deal with balancing the terms in a chimeric model like this?

``````model1 = pm.Model()
#Number of topics
K=10
#X is scipy sparse matrix of word x topic counts
(D,V)=X.shape
#beta prior for word over topic distribution (phi)
beta = np.ones((K-1, V))*10
sparse_array=shared(np.array([X.nonzero(),X.nonzero(),X.data]).T.astype('int32'))
tt.cast(sparse_array,'int32')
rowsums=shared(np.sum(X,axis=1).T)
sumall=shared(np.sum(X))

#Concatenate kth topic drawn from gaussian to gammas
#Make rows sum to one to turn gammas into theta dirichlet
def normalizerows(gammas,ambdist,rowsums,alpha=1e-9):
fixambdist=tt.min(tt.concatenate([10**ambdist.T/rowsums,tt.ones(ambdist.T.shape)]),axis=0).reshape([gammas.shape,1])
normalized=10**(tt.log10(1-fixambdist+alpha)+tt.log10(gammas+alpha))/(tt.sum(gammas+alpha,axis=1).reshape([gammas.shape,1]))
normalized = tt.concatenate([normalized,fixambdist],axis=1)
return normalized

#Set Kth topic in phi to known mixture (saves 10% training time)
def appendphi(phi,phiAmbient):
return tt.concatenate([phi,phiAmbient.reshape([1,phi.shape])],axis=0)

def log_lda_basic(theta, phi,value):
ll = value[:,2] * pm.math.logsumexp(tt.log(theta[value[:,0].astype('int32')]+1e-10)+ tt.log(phi.T[value[:,1].astype('int32')]+1e-10),axis=1).ravel()
ambientll=ambdist.distribution.logp(tt.log10((rowsums*theta[:,theta.shape-1])+1e-10))
#If you don't multiply whole likelihood by large number, ADVI won't fit properly
return(1e9*(tt.sum(ll))+ tt.sum(ambientll))

with model1:
#Empirical parameters for fixed gaussian
ambientmu=pm.Deterministic('ambientmu',shared(ambmu))
ambientsigma=pm.Deterministic('ambientsigma',shared(ambsd))
ambdist = pm.TruncatedNormal('ambdist',shape=(D,1),mu=ambientmu,sigma=ambientsigma,lower=.1)
gammas = pm.Gamma('gammas',alpha=1,beta=1,shape=(D, K-1))
#Theta is Doc x Topic mixtures
theta = pm.Deterministic('theta',normalizerows(gammas,ambdist,rowsums))
phihat = pm.Dirichlet("phihat", a=beta, shape=(K-1, V), transform=t_stick_breaking(1e-9))
#phi is Topic x Word mixtures
phi = pm.Deterministic('phi',appendphi(phihat,shared(phiAmbient)))
doc = pm.DensityDist('loglikelihood',log_lda_basic,observed=dict(theta=theta, phi=phi,value=sparse_array))

with model1: 