How to fit beta distribution from bin and count data?

I used a lab instrument to acquire millions of datapoints and the machine returns a summary file with the count data for each bin. I can’t generate n values for each bin based on the count data, since my laptop wouldn’t be able to hold this data in memory, and I haven’t been able to find a case where someone was able to use a library like Dask or Vaex to get around the memory issue.

So, I would like to know if it’s possible to use PyMC to fit distributions if I only have the bin and count data?

CC @ferrine I think he worked on something similar

Hi, you can use this reference implementation to build logp for the observed with bins and counts.

You need a target distibution, let’s say

dist=pm.Beta.dist(a, b)

and your histogram data (bins+counts) that you pass to the potential

with pm.Model() as model:
    a = pm.Exponential("a", 1)
    b = pm.Exponential("b", 1)
    dist = pm.Beta.dist(a, b)
    pm.Potential("obs", pm.logp(dist, histogram["mid"]) * histogram["count"])
1 Like