Discrete distribution question

I am a bit new to pymc3 and probabilistic programming in general. I have a dataset on widgets. The widgets get inspected at a certain frequency and when that happens, the machine that produces widgets is down. We think that the widgets are sampled way too often. For example, if the natural defect rate is 1 in 10 on average, the widgets are getting inspected at 1 in 5.

I want to prove my hypothesis and do so, I am considering pymc3. My thought is

  1. Build a pymc3 model with uniformative prior for the defect rate (flat normal)
  2. Calculate the likelihood for defect rate using the daily data
  3. The posterior will show the defect rate distribution.
  4. We will set the inspection frequency such that the chance of seeing a defect rate higher than the frequency will be <=5%

I have daily data for each widget inspection and it’s outcome. Can anyone verify this thought process? Specifically, I am not sure about how I will populate the observed parameter in the likelihood. I think I will pass in the daily defect rate to it, but not sure whether it will be daily or weekly or some other frequency. Can anyone guide me here?

Hi add53-

I suspect that you have a more detailed understanding of the process than what you’ve written down. Could you be a little more explicit?

For instance: you write that “the natural defect rate is 1 in 10 on average.” This, to me, means that any widget, is independent of all others, and has a 10% chance of being defective. This means that the inspection frequency has no bearing whatsoever on the observed defect rate; it will always be binomial(N*r,0.1) with N the number of widgets produced per unit time.

If I were to venture a guess, you’re probably looking for a change in the defect rate; i.e. you want to detect when the machine is producing too many defects, and capture it quickly enough to avoid incurring too much cost. Is this right?

So there is a whole aspect here of how to model the costs and the decision of whether to inspect a machine at given moment; given the length of time since the last inspection, and the result of the last inspection. A common model used in these cases is a POMDP, and the general subject heading is Operations Research. My recommendation is to seek out expert advice, or at the very least obtain a textbook on the subject.

That said, there is the purely inferential problem from the daily inspection data. From the way you are talking about the problem, I infer that the production rate is low, and that once a machine produces a defective widget, it will continue to produce only defective widgets until it is repaired.

In this case, the quantity of interest is the number of successful widgets before the first failure – the geometric distribution. And (if my assumption is correct) then you have a very mildly censored version: If you’re testing every 10 widgets, and a machine fails on the 9th inspection, then the machine started failing somewhere between the 81st and 90th widget – so it’s a geometric with a ceiling function. Something like this should work:

class CeilingGeometric(pm.Discrete):
    def __init__(self, p, k, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.k = k
        self.inner_dist = pm.Geometric.dist(p)
        self.inner_logp_ = self.inner_dist.logp
        self.inner_rand_ = self.inner_dist.random
        self.k_ = 1 + np.arange(k)
        
    def logp(self, value):
        K = tt.tile(self.k_, value.shape[0]).reshape((self.k, value.shape[0]))
        return tt.log(tt.sum(tt.exp(self.inner_logp_(self.k*(value-1) + K)), axis=0))
    
    def random(self, point=None, size=None):
        underlying = self.inner_rand_(point, size)
        return np.array((1 + np.floor(underlying/self.k)), dtype=np.int64)
        

with pm.Model() as mod:
    fail_rate = pm.Beta('fail_rate', 1., 1.)
    first_fail = CeilingGeometric('first_fail', fail_rate, test_every, testval=2, observed=np.array([2, 3, 3, 1, 2, 2, 10, 1, 3, 1, 5, 
                                                                                                     1, 4, 15, 3, 2, 7, 2, 1, 3, 2, 6, 
                                                                                                     1, 10]))
    tr_pr = pm.sample_prior_predictive(100)
    tr = pm.sample(tune=800, chains=3, cores=2)

2 Likes

Take a look at the random implementation, which should give you a sense of exactly what the distribution computes (and thus what observed= should be).

widget_fail_rate = 0.02
test_every = 10
obs = np.array(1 + np.floor(np.random.geometric(widget_fail_rate, size=100)/test_every), dtype=int)

Edit: I should also point out that

tt.log(tt.sum(tt.exp(self.inner_logp_(self.k*value + K)), axis=0))

is numerically unstable and you should use

pm.math.logsumexp(self.inner_logp_(self.k*(value-1) + K,axis=0)

instead. The use of (value-1) in the second line fixes a bug present in the first version of the code.

Also, somehow the posterior isn’t matching my intuition for extreme (i.e. uninformative) values of k. For instance, if k=100, and we observe [1,1,1,1,…,1] (i.e. the failure always occurs somewhere in the first 100 widgets); then the posterior ought to down-weight values of p < 1/k; instead I’m finding that values of p ~ 1 are also having low density on the posterior. I don’t think this should be happening

for j in [1, 10, 25, 50, 100]:
    with pm.Model() as mod:
        fail_rate = pm.Beta('fail_rate', 1., 1.)
        first_fail = CeilingGeometric('first_fail', fail_rate, test_every, testval=2, observed=np.ones(j))
        tr = pm.sample(tune=2500, draws=400, chains=6, cores=2, nuts_kwargs={'target_accept': 0.90})
        pm.traceplot(tr)
        plt.figure()

This seems to be a general consequence of having k > 1/p, even when the observations take values other than 1; for instance

widget_fail_rate = 0.03
test_every = 50
obs = np.array(1 + np.floor(np.random.geometric(widget_fail_rate, size=1000)/test_every), dtype=int)

shows an obvious downward bias:

and I can’t figure out why this might be the case…

So the CeilingGeometric implementation indexes inspections; so obs=1 means the first time the machine was inspected, it had failed. However, the first inspection corresponds to the kth produced widget, since we are inspecting every k widgets.

So the [2, 3, 3...] means failure to pass the 2nd inspection, 3rd inspection, 3rd inspection. If k=10 this corresponds to 20th widget, 30th widget, 30th widget.

Importantly the machine may start failing on widget 22, but it is only detected for widget 30, as it won’t be inspected until that point.

1 Like

we can decide our inspection frequency such that it caters to the most of the distribution (i.e. 90% or 95%).

This is the point at which Bayesian modeling stops, and cost modeling starts. As a first pass, I’d consider a cost per defective widget c_d, and a cost per inspection c_i, and some (fixed) time window, say 5000 total widgets. Then what you care about is

d_0 = 0
b_i|p \sim \mathrm{Bernoulli}(p)
d_i | i, d_{i-1} = \mathrm{min}(1, d_{i-1}+b_i) \times (i \; \mathrm{mod} \; k)
C_i|i, d_i, C_{i-1} = C_{i-1} + c_d d_i + c_i (i \; \mathrm{mod} \; k)

This is really easier to do outside of tensorflow/pymc3:

c_d = 6
c_i = 12
bigT = 1000

c_d = 6
c_i = 12
bigT = 1000

p_draws = np.random.beta(20, 200, size=1000) # this would be your trace
total_costs = dict()
for k in range(4, 20)[::2]:
    total_cost = np.zeros(1000, dtype=np.float32)
    d = np.zeros(1000)
    for i in range(bigT):
        if (i+1) % k == 0:
            d = np.zeros(1000)
            total_cost = total_cost + c_i
        else:
            b = np.random.binomial(1, p_draws)
            d = np.minimum(d + b, 1)
            total_cost = total_cost + c_d * d
    total_costs[k] = total_cost

for k in total_costs.keys():
    sbn.kdeplot(total_costs[k], label='k={}'.format(k))

image

Now you can say minimize expected cost; or minimize the 80th percentile of the projected cost through your choice of inspection schedule…

Nope. I’m not sure what you’re trying to minimize; but you’ve probably got a better sense of the issue than I do. Is the inferential problem solved, at least?

@chartl I guess my question is can we infer “natural” failure rate based on the posterior and adjust the inspection frequency to cater to the “natural” failure rate?

Yes; but you’re assuming that setting the inspection rate to match the failure rate is “better.” In my mind this is an entirely unwarranted assumption. If the cost of inspection is some multiple of the cost of a defective product (call this “alpha”), then there is only a small regime where this assumption holds. For a failure rate of 5%, this corresponds to an alpha of somewhere between 5 and 6:

image

and even then there’s some ambivalence. Higher relative costs mean inspecting less frequently than the natural rate:

image

and lower costs mean inspecting more frequency:

image

so even though cost modeling is beyond the scope of the project your stated solution makes implicit assumptions about the cost model, which is why I cannot say that it makes sense: I don’t know enough to say whether it is reasonable or not.

I think you may mean something different from what you are saying. Cutting down the number of inspections necessarily increases the number of missed defects, and thus the rate of produced defects.

image

post 4

@chartl when you look at the cost, C_d will have the cost of the defective widgets that we identified during the inspection as well as the cost of unidentified defective widgets. For example, when you inspect every 2nd widget, the cost will be the cost of the identified defective widget (this comes from the trace) as well as the cost of the unidentified defective widget (this could the first widget with a certain probability). Similarly, if you inspect every 4th, the cost will be cost of the identified defective widget (coming from the trace) as well as cost of the previous unidentified defective widget with a certain probability, the cost of the previous to previous unidentified defective widget with a certain probability etc.

how could the distribution of the unidentified defective widgets be modeled?

can anyone help here?