Indeed having two time series is a bit strange. I think you are over fitting to the data. I only quickly read the code but it seems like ‘t’ data is essentially per a match for each team. The AR1() process has a high standard deviation of innovations so it will essentially fit to each match result, and thus have no predictive power. You could either reduce the sigma or have the t variable apply for a few matches at a time.
As an aside you might be interested to look into pd.factorize(), this could be used rather than your custom codify() function.