Hi! I’m working with a dataset about technical verification of vehicles. The dataset contains the results of said verifications for a large number of vehicles throughout the years. I want to model, as a first approximation, the probabilities of ocurrence for each of the 3 states a Certificate may be in. Since I was only given the SQL database I immediately crafted a dataset consisting of some tags and a binary encoding of the certificate’s results, this is:
Date | Certificate Number | City | Approve | Conditional | Rejected |
---|---|---|---|---|---|
… | … | … | 1 | 0 | 0 |
… | … | … | 0 | 0 | 1 |
And so. The model I crafted is a multivariate Bernoulli with the following code: |
with pm.Model() as certificate_model:
p = pm.Dirichlet('p', np.ones(3))
y = pm.Bernoulli('y', p=p, observed=data.head(10000))
trace = pm.sample(1000, tune=1000)
# ppc = pm.sample_posterior_predictive(trace)
trace.to_netcdf(f"analysis_data/trace_{YEAR}.nc")
Which gives the following results (sorry for the misplaced label):
Which works in yielding me some simple results since I’m not yet trying to include contextual information into the model. However, I’m unsure whether this was the correct approach for modelling this data (I’ve never worked with this kind of categorical model) . I would be interested on advice on how to model this dataset and if my approach would need any tweaks.
Thanks!!