Hierarchical model for language use in github repos

Hi Experts,

im trying to make a hierarchical model to see how the amount of code (in a specific language) varies across repos and projects within an organisation.

The goal of my project was to identify knowledge silos, based on the language used in repositories, by analyzing the posterior distributions of language usw. The posterior can serve as a diagnostic tool, revealing languages that are disproportionately relied upon and identifying anomalies that may indicate the presence of knowledge silos.

By proactively identifying these silos, we can mitigate potential bottlenecks in knowledge transfer and code maintenance by diversifying and providing training in areas that could be beneficial.

It’s for fun as I’m learning bu a full (reproducible poetry env) repo of the work is here:

Which allows you to recreate dummy data using generate_dummy_data.py and the analysis is present in tribal_knowledge.ipynb

I’ve been able to create simple models, but as soon as I try to work with more complex examples, I encounter shape mismatches.

# Build the model
with pm.Model() as language_usage_model:
    # Define organisational-level hyperpriors that influence project-level parameters
    # These represent our assumptions about the variability across all projects
    a_mu = pm.Gamma('a_mu', alpha=1.0, beta=1.0, initval=2.0)
    b_mu = pm.Gamma('b_mu', alpha=1.0, beta=1.0, initval=2.0)
    a_kappa = pm.Gamma('a_kappa', alpha=1.0, beta=0.1, initval=2.0)
    b_kappa = pm.Gamma('b_kappa', alpha=1.0, beta=0.1, initval=2.0)

    # Language-Level Priors
    # Assuming each language has an associated effect on the byte counts
    language_effect = pm.Normal('language_effect', mu=0, sigma=1, shape=n_languages)
    # Define project-level priors that capture the mean and variability of language usage within each project
    # 'mu_project' represents the expected proportion of language usage within projects
    # 'kappa_project' captures the variability of language usage within projects
    mu_project = pm.Beta('mu_project', alpha=a_mu, beta=b_mu, shape=n_projects)
    kappa_project = pm.Gamma('kappa_project', alpha=a_kappa, beta=b_kappa, shape=n_projects)
    # Define repository-level effects influenced by their respective projects 'theta_repo' represents the expected
    # Proportion of language usage within each repository, influenced by the project to which it belongs.
    theta_repo = pm.Beta('theta_repo', alpha=mu_project[project_idx] * kappa_project[project_idx], 
                         beta=(1 - mu_project[project_idx]) * kappa_project[project_idx], 
    # Repository-Level Likelihood adjusted for language effect
    # Here, we assume that the byte counts are influenced by both the repository effect and the language effect
    mu_repo_lang = theta_repo[repo_idx] + language_effect[languages_idx]
    # Define the dispersion parameter for the Negative Binomial distribution to account for overdispersion in the byte counts 'dispersion_factor' controls the variance of the
    # Negative Binomial distribution independently from the mean
    dispersion_factor = pm.Exponential('dispersion_factor', lam=1.0)
     # The Negative Binomial distribution models the observed byte counts, now influenced by both repository and language effects
    language_count = pm.NegativeBinomial('language_count', mu=mu_repo_lang, alpha=dispersion_factor, observed=byte_count)

This results in

AssertionError: SpecifyShape: Got shape (1740,), expected (15,).
Apply node that caused the error: SpecifyShape(AdvancedSubtensor.0, 15)
Toposort index: 19
Inputs types: [TensorType(float64, shape=(None,)), TensorType(int8, shape=())]
Inputs shapes: [(1740,), ()]
Inputs strides: [(8,), ()]
Inputs values: [‘not shown’, array(15, dtype=int8)]
Inputs type_num: [12, 1]
Outputs clients: [[Mul(SpecifyShape.0, AdvancedSubtensor.0)]]

This feels like I’ve a typo / not quite understood the inner workings of how to apply the data rather than an ill defined model

Hi. I think that’s because your theta_repo Beta distribution has an assigned shape = n_repos, but the parameters for the distribution have a project_idx (which I assume is for the shape = n_projects. Here’s a reproducible example of the same behaviour:

import numpy as np
import pymc as pm

obs = np.random.normal(0, 1, 1000)
idx_1 = np.array([np.arange(10) for i in range(100)]).flatten()
idx_2 = np.repeat(np.arange(20), 50)

with pm.Model() as mod:
    a = pm.Normal('a', 0, 1, shape=10)
    b = pm.Normal('b', mu=a[idx_1], sigma=1, shape=20)
    y = pm.Normal('y', mu=b[idx_2], sigma=1, observed=obs)

If you run the mod.debug() command on the model I wrote above, you’ll see that you get the “shapes must be equal” assertion error. Because b is shape=20 but parametrised with a, which has an index (idx_1) intended for shape=10.

1 Like

Hi Simon,

Thanks for the response, this was exactly the issue that I was facing, many thanks for helping

1 Like