Hi Experts,
im trying to make a hierarchical model to see how the amount of code (in a specific language) varies across repos and projects within an organisation.
The goal of my project was to identify knowledge silos, based on the language used in repositories, by analyzing the posterior distributions of language usw. The posterior can serve as a diagnostic tool, revealing languages that are disproportionately relied upon and identifying anomalies that may indicate the presence of knowledge silos.
By proactively identifying these silos, we can mitigate potential bottlenecks in knowledge transfer and code maintenance by diversifying and providing training in areas that could be beneficial.
Itâs for fun as Iâm learning bu a full (reproducible poetry env) repo of the work is here:
Which allows you to recreate dummy data using generate_dummy_data.py
and the analysis is present in tribal_knowledge.ipynb
Iâve been able to create simple models, but as soon as I try to work with more complex examples, I encounter shape mismatches.
# Build the model
with pm.Model() as language_usage_model:
# Define organisational-level hyperpriors that influence project-level parameters
# These represent our assumptions about the variability across all projects
a_mu = pm.Gamma('a_mu', alpha=1.0, beta=1.0, initval=2.0)
b_mu = pm.Gamma('b_mu', alpha=1.0, beta=1.0, initval=2.0)
a_kappa = pm.Gamma('a_kappa', alpha=1.0, beta=0.1, initval=2.0)
b_kappa = pm.Gamma('b_kappa', alpha=1.0, beta=0.1, initval=2.0)
# Language-Level Priors
# Assuming each language has an associated effect on the byte counts
language_effect = pm.Normal('language_effect', mu=0, sigma=1, shape=n_languages)
# Define project-level priors that capture the mean and variability of language usage within each project
# 'mu_project' represents the expected proportion of language usage within projects
# 'kappa_project' captures the variability of language usage within projects
mu_project = pm.Beta('mu_project', alpha=a_mu, beta=b_mu, shape=n_projects)
kappa_project = pm.Gamma('kappa_project', alpha=a_kappa, beta=b_kappa, shape=n_projects)
# Define repository-level effects influenced by their respective projects 'theta_repo' represents the expected
# Proportion of language usage within each repository, influenced by the project to which it belongs.
theta_repo = pm.Beta('theta_repo', alpha=mu_project[project_idx] * kappa_project[project_idx],
beta=(1 - mu_project[project_idx]) * kappa_project[project_idx],
shape=n_repos)
# Repository-Level Likelihood adjusted for language effect
# Here, we assume that the byte counts are influenced by both the repository effect and the language effect
mu_repo_lang = theta_repo[repo_idx] + language_effect[languages_idx]
# Define the dispersion parameter for the Negative Binomial distribution to account for overdispersion in the byte counts 'dispersion_factor' controls the variance of the
# Negative Binomial distribution independently from the mean
dispersion_factor = pm.Exponential('dispersion_factor', lam=1.0)
# The Negative Binomial distribution models the observed byte counts, now influenced by both repository and language effects
language_count = pm.NegativeBinomial('language_count', mu=mu_repo_lang, alpha=dispersion_factor, observed=byte_count)
This results in
AssertionError: SpecifyShape: Got shape (1740,), expected (15,).
Apply node that caused the error: SpecifyShape(AdvancedSubtensor.0, 15)
Toposort index: 19
Inputs types: [TensorType(float64, shape=(None,)), TensorType(int8, shape=())]
Inputs shapes: [(1740,), ()]
Inputs strides: [(8,), ()]
Inputs values: [ânot shownâ, array(15, dtype=int8)]
Inputs type_num: [12, 1]
Outputs clients: [[Mul(SpecifyShape.0, AdvancedSubtensor.0)]]
This feels like Iâve a typo / not quite understood the inner workings of how to apply the data rather than an ill defined model