I think the Kruschke approach/example is illustrative, but should be avoided in practice (e.g., he never uses HMC/NUTS). Instead, a typical approach would be to marginalize out the indicator variable. In the case of just 2 models, you can just use a single, continuous “mixing” parameter bound to [0, 1] (e.g., using a Beta prior) and make your likelihood a weighted-mixture of the 2 models. With more than 2 models, you’re looking at a Dirichlet-distributed set of mixing parameters.