As most do these days, I run most simple problems by ChatGPT to code faster. It’s great for boilerplate stuff, but it produces dangerous hallucinations when you start asking questions that require knowledge of framework APIs.
So I created a custom GPT, which is instructed to only answer questions about PyMC. I’ve provided the GPT with ~200.000 lines of documentation (just wrote an ugly little depth-first recursive scraper and ran it on the example gallery, the “Learn” pages and API docs).
The PyMC GPT is here. Happy to get feedback. Been using it myself a little bit and it’s not perfect, but it definitely does a better job at answering PyMC questions than the default GPT4 model.
Cool! Have you considered adding the discourse to the training corpus? It’s quite easy to scrape, you can just add
.json to the urls and you get everything you need back. I think the back-and-forth nature of the dialogue would be a good fit for training an instruct model.
Is the intent here that you’ll definitely need to pay for chatGPT+ to use this?
Thanks! Didn’t know about the
.json thing, will definitely do this!
Yes, I should have added that as a disclaimer. It’s based on GPT4, and I don’t think you can use that without a subscription.
OK spent 10 minutes on it and I can’t find a simple way to get a list of all topics. You know how to do that?
They’re just sequential numbers. Also the title name isn’t required. For example you can get to this thread by using
https://discourse.pymc.io/t/13612. A loop over a the numbers will never 404, so you can just check the errors field of the return json and skip if it says “The requested URL or resource could not be found.” For example there’s no https://discourse.pymc.io/t/1.
Of course . Adding Arviz docs now and this will be up soon too!
Very cool. This is trivial, but it’s funny to test: Regular ChatGPT 4 and your new GPT give very different answers when asked “what is bambi?”
@ulfaslak thanks a lot for creating this project! I think such an LLM-based assistant is very important to grow the adoption of PyMC and reduce the initial barrier. I think the project should be promoted on the PyMC website and in the docs.
I’ve had a conversation with PyMC GPT about hierarchical modelling in the battery cell manufacturing domain: https://chat.openai.com/share/c9444b2c-0923-4b4e-b29e-4b4e4a3ced3a. It looks decent to me, although often answers were overly vague. The only blunder that I noticed is the suggestion to use Theano for vectorisation instead of PyTensor. But I’m unfamiliar with PyMC, I’m sure there are more.
I’m glad you like it .
It’s not perfect though. I’m keeping the docs I have scraped for ingestion in another AI later and will of course update in this thread when I do that.
The main problem with GPT4 is that it doesn’t actually consume the provided knowledge to update its state, it relies on RAG (retrieval augmented generation) to produce answers. Those retrievals happen when the UI displays a “Searching Knowledge” icon, and is not really different from e.g. Bing searches.
Google Gemini has a 10M context window, which could actually fit all of these docs, but I think OpenAI will probably beef up their custom models quite soon, so stay tuned for updates.