Sampler did not converge while running a Topic Model (Latent Dirichlet Allocation with single topic per document)

I leveraged a blog to build a toy topic model (Latent Dirichlet Allocation), with a single topic per document. However, I keep getting a message that the sampler did not converge. I tried running for 15,000 iterations and various tuning iterations, but did not get convergence. The traceplots looked fine to me, but the Gelman Rubin statistic is large. I did not notice any significant autocorrelation. Code and results follow.

# Define toy data
docs = [["sepak","bola","sepak","bola","bola","bola","sepak"],  
         ["uang","ekonomi","uang","uang","uang","ekonomi","ekonomi"],  
         ["sepak","bola","sepak","bola","sepak","sepak"],  
         ["ekonomi","ekonomi","uang","uang"],  
         ["sepak","uang","ekonomi"],  
         ["komputer","komputer","teknologi","teknologi","komputer","teknologi"],  
         ["teknologi","komputer","teknologi"]]

# Setup variables
def create_dict(docs):
    k = 0
    dict = {}
    for i in range(len(docs)):
        for j in range(len(docs[i])):
            if docs[i][j] in dict: 
                continue
            else:
                dict[docs[i][j]] = k
                k += 1
    return(dict)

def to_nparray(dict):
    c = []
    for doc in docs:
        d = []
        for word in doc:
            d.append(word)
        c.append(d)
    array = np.array(c)
    return(array)

dict = create_dict(docs)
collection = to_nparray(dict)
V = len(dict)
D = len(docs)
Nd = [len(docs[i]) for i in range(len(docs))]
K = 3
alpha = np.ones(K)
beta = np.ones(V)

# Define Bayesian model
with Model() as model:
   theta = pm.Dirichlet('theta', a = alpha, shape = K)
   phi = T.stack([pm.Dirichlet('phi_%i' %k, a = beta, shape = V) for k in range(K)])
   topic = T.stack([pm.Categorical('topic_%i' %d, p = theta, testval = np.random.randint
                           (low = 0, high = K)) for d in range(D)])
   words = T.stack([pm.Categorical('word_%i_%i' % (d, w), p = phi[topic[d]], observed = dict[collection[d][w]]) 
                              for d in range(D) for w in range(Nd[d])])
    trace = pm.sample(2000, tune = 1000)

# Results of Running Sampler
Multiprocess sampling (4 chains in 4 jobs) 
CompoundStep
NUTS: [phi_2, phi_1, phi_0, theta]
CategoricalGibbsMetropolis: [topic_6, topic_5, topic_4, topic_3, topic_2, topic_1, topic_0] 
Sampling 4 chains: 100%|██████████████████████████████████████████████████████| 12000/12000 [09:39<00:00, 6.03draws/s] 
The acceptance probability does not match the target. It is 0.6950269754271206, but should be close to 0.8. Try to increase the number of tuning steps. 
The acceptance probability does not match the target. It is 0.6449513659411206, but should be close to 0.8. Try to increase the number of tuning steps. 
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge. 
The estimated number of effective samples is smaller than 200 for some parameters

Summary Statistics Results:
            mean       sd    mc_error hpd_2.5  hpd_97.5   n_eff    Rhat
-----------------------------------------------------------------------------------------------------------
|theta__0|0.326429|0.148393|0.004660|0.066455|0.620063|68.427656|1.038315|
|theta__1|0.348330|0.151329|0.005352|0.076301|0.632969|27.723458|1.063296|
|theta__2|0.325240|0.148128|0.005111|0.072891|0.612247|37.291276|1.051810|
|phi_0__0|0.335522|0.171161|0.014080|0.001503|0.583588|2.844687|1.899209|
|phi_0__1|0.289792|0.165516|0.013664|0.000005|0.529054|2.805311|1.915413|
|phi_0__2|0.142765|0.164167|0.015015|0.000102|0.490383|2.342530|2.715749|
|phi_0__3|0.128347|0.140347|0.012431|0.000047|0.428061|2.471946|2.367308|
|phi_0__4|0.052262|0.047657|0.000864|0.000036|0.147874|2101.527291|0.999709|
|phi_0__5|0.051312|0.047706|0.000872|0.000035|0.146897|2674.015428|0.999037|
|phi_1__0|0.081276|0.063063|0.001973|0.000002|0.204864|57.973291|1.041411|
|phi_1__1|0.057955|0.054908|0.001323|0.000003|0.167564|1232.105065|1.015667|
|phi_1__2|0.235744|0.185226|0.016569|0.000240|0.531797|2.431417|2.448049|
|phi_1__3|0.208706|0.163066|0.013980|0.000009|0.475488|2.613935|2.124938|
|phi_1__4|0.190877|0.167152|0.014021|0.000013|0.491401|2.716883|2.000655|
|phi_1__5|0.225441|0.197889|0.017274|0.000074|0.568331|2.540449|2.235114|
|phi_2__0|0.164252|0.168114|0.015004|0.000011|0.512720|2.453771|2.401930|
|phi_2__1|0.134444|0.146320|0.012789|0.000020|0.443560|2.544000|2.232140|
|phi_2__2|0.150659|0.166300|0.014851|0.000048|0.494088|2.450101|2.403961|
|phi_2__3|0.135566|0.140907|0.012189|0.000020|0.426898|2.577825|2.167695|
|phi_2__4|0.190733|0.166964|0.014085|0.000327|0.498736|2.708793|2.023608|
|phi_2__5|0.224346|0.198082|0.017416|0.000070|0.558734|2.507467|2.293637|

Am I doing something wrong? Or is this just a manifestation of the toy data? If I am doing everything right, then what can I do to get convergence? Thank you for your guidance and advice.

If the trace plot looks fine (good mixing from visual inspection), this is usually an indication of label switching. You can search on this discourse about discussion and technique dealing with mixture model.

Yes, Gelman-Rubin won’t work for these models.

You can check https://docs.pymc.io/notebooks/gaussian_mixture_model.html if you maybe can break symmetries somehow. But if convergence looks fine I wouldn’t worry about it.