I used a lab instrument to acquire millions of datapoints and the machine returns a summary file with the count data for each bin. I can’t generate n values for each bin based on the count data, since my laptop wouldn’t be able to hold this data in memory, and I haven’t been able to find a case where someone was able to use a library like Dask or Vaex to get around the memory issue.
So, I would like to know if it’s possible to use PyMC to fit distributions if I only have the bin and count data?
Hi, you can use this reference implementation to build logp for the observed with bins and counts.
You need a target distibution, let’s say
dist=pm.Beta.dist(a, b)
and your histogram data (bins+counts) that you pass to the potential
with pm.Model() as model:
a = pm.Exponential("a", 1)
b = pm.Exponential("b", 1)
dist = pm.Beta.dist(a, b)
pm.Potential("obs", pm.logp(dist, histogram["mid"]) * histogram["count"])