Extended event: gathering PyMC usage information


We have been thinking about gathering information about PyMC usage, and we’d like to have a PyMCon event to structure and advertise this effort. This initial topic is for introducing the idea and creating a space for everyone to join the discussion and help us define the initiative: what information should be gathered? What information do you struggle to share with devs?

Ideally we’d gather a corpus of models from anyone who is willing and able to share, but we’ll also need to analyze those, so we will soon start working on the tool to use for that so that anyone can also run the analysis locally and share only the results with us (i.e. which distributions are used, which sampling methods, which of their defaults have been changes…).

Here are some of our initial ideas on information we’d like to have; please add more on the topic below!

  • Which distributions are being used and how often?
  • How big are the models? # of variables being sampled by MCMC? # of observations, how close are we from models that don’t fit in RAM of common computers?
  • Which sampling functions are more common? Which defaults are most often modified?
  • What operations are more common with PyMC’s outputs: plotting with ArviZ, saving to disk, converting to NumPy/Pandas objects…

Questions I have

  • How many users are using pymc3 vs pymc
  • What are most common packages also imported
    • How many folks use xarray operations, arviz, scipy et
  • Do people use the default sampler, specify their own, or change sampler arguments
  • What are the most common prior parameters
    • I know this is dependent on the context of the model but just curious for different distributions does anything stick out?
1 Like
  • Which backend is being used?
  • Number of divergences?
  • Total sampling time?
  • ESS
  • Scientific domains/industries where PyMC is being used
  • Types of data being studied (purely cross-sectional, purely time series, longitudinal, Geo-spatial…)
  • Size of datasets being analyzed
  • Use of coords/named dims in models
  • Causal identification strategies (if any/applicable)
  • Repos associated with published papers?
  • Use of “basic” PyMC vs specialized sub-modules/associated projects: GP, BART, sun-ode, Bambi, (others?)