Decomposition of rates?

@chartl perhaps my problem can be reformulated into something that is more appropriate. This is my first use of these techniques so I’m very green both using pymc3 and the underlying theories.

I’m trying to estimate the production rate of some manufacturing lines including information on the variability. Unfortunately I only know how many cartons made within an 8 hour block and sometimes multiple products can be made in those blocks.

Simply stated that’s:
hours = 8 = carton_sku1/carton_per_hour_sku1 + carton_sku2/carton_per_hour_sku2 …

carton_per_hour_skun being what I want and being normally distributed.

When I solve this, I take ‘hours’ being normally distributed with observed data of ‘8’ (whereas really it’s deterministic).

The end result is the correct mean for ‘carton_per_hour’ but all the variability is in the ‘hours’ variable.

Is there a way I can reframe this to get the desired result?

1 Like

Hi @sverzijl. I’ve taken the liberty to move this question to a new topic (please feel free to edit the title).

If I’m understanding you correctly, one observation consists of a count (total number of cartons of any type, call this y), and an indicator of what types of cartons are involved (say x = (0, 0, 1, 0, 1, 0) if line was generating the 3rd and 5th cartons during that 8 hour period).

I would treat this as having a random variable \vec r, where r_i \sim \mathrm{Beta}(a,b) is the rate of production of the i-th type, a collection of random variables \phi \sim \mathrm{Dir}_0(\alpha) where \phi_i gives the proportion of the 8-hour period spent making the i-th type. By \mathrm{Dir}_0(\alpha) I mean that, for a given binary vector x^{(j)} and proportion vector \phi^{(j)}, \phi^{(j)}=0 wherever x^{(j)} = 0$ and the non-zero components of \phi^{(j)} are Dirichlet-distributed with the appropriate dimensionality.

One way to model this tractably is to pretend that the cartons being produced are interleaved rather than sequentially preduced (all type-1 followed by all type-2 followed by …). Under this assumption, the total count looks like

\mathrm{Count} \sim \mathrm{Pois}(r_1\phi_1) + \mathrm{Pois}(r_2\phi_2) + \dots + \mathrm{Pois}(r_k\phi_k)
=\mathrm{Pois}(r_1\phi_1 + r_2\phi_2 + \dots + r_k\phi_k)
=\mathrm{Pois}(\langle r, \phi \rangle)
=\mathrm{Pois}(\Phi r, \mathrm{shape}=n_\mathrm{obs})

So I think a Poisson likelihood is a defensible choice to use.

If the variance of the Poisson (equal to the mean) bothers you, you could use a Generalized Poisson Distribution which allows for over-dispersion or under-dispersion:

https://journals.sagepub.com/doi/pdf/10.1177/1536867X1201200412

Or use an adjusted normal approximation to the poisson (N(\lambda, \delta \lambda)).

2 Likes

@chartl Thank you very much for your detailed response - I am going to test it out and see how I go.

Just a couple things to clarify:
I actually know more than just the total cartons made - I know the quantity of each carton made as well. For example: x = (0, 0, 1000, 0, 488, 0), not just 1488.

The other thing that I am struggling to wrap my mind around is:
ϕ is the proportion of the 8 hour period that the product run for which I am estimating as well as r_n, but ϕ is not independent of r_n… Does that matter?

I know the quantity of each carton made as well. For example: x = (0, 0, 1000, 0, 488, 0), not just 1488.

Then \mathrm{Count} \sim \mathrm{Pois}(\Phi \mathrm{diag}(r), \mathrm{shape}=(n_{obs}, n_{types}))

but ϕ is not independent of r_n

For the posterior this will certainly be the case (slower cartons should take more time on average) – though I’m not sure this should be encoded in the prior necessarily. I would see how this works before doing something more complicated.

@chartl I had put this on hold and came back to it today.

I implemented the above and it solves. Is there a way to specify the constraint that ϕ=0 when x=0? I tried excluding ϕ where x=0 (by multiplying it by 0) but it still gives a non-zero value to ϕ and relatively high (0.06).

Thanks in advance!

So I tested in Excel the above using max likelihood and it does indeed come up with the correct answer.

Now to implement it with PyMC (obviously Excel can’t deal with the large number of variables)