Restrictions for dimension and variable names

While writing a pull request that adds dimension names as alternative to numerical shapes, I wondered what restriction we want to enforce for those names.
The relevant PR is here.
I also noticed that currently we do not have any restrictions for the names of variables. This is perfectly legal code:

with pm.Model():
    pm.Normal('')
    pm.Normal('\n/ ')
    pm.Normal(' ')
    # let's hope nobody has `text.usetex` enabled in matplotlib :-)
    pm.Normal(r'\immediate\write18{/usr/bin/rm -rf /}')
    trace = pm.sample()

I think this raises three questions:

  1. Can we add minimal restrictions for variable names without breaking user code?
  2. Should we have similar or stronger restrictions for dimension names when we merge that PR? Here we have no backward compatibility issues.
  3. What restrictions should we have in pymc4?

About 1:
We could limit variable names in a similar way as python identifiers. But since quite a few people seem to use spaces in variable names, we should at least allow spaces within the name. A regular expression like that might work well: r"^[^\d\W][\w ]*(?<=\w)\Z". You can test what it does with var = re.compile(r"^[^\d\W][\w ]*(?<=\w)\Z", re.UNICODE) and var.match("some_var_name") is not None. That would mean: No spaces or digits as first character. No special characters like \ / # $ \n \t no whitespace at the end of the varname. We could start printing warnings for variable names that do not conform to that format, and in a later release throw an exception.

About 2:
We could just have the same restrictions as for variable names, but as this is new we could also disallow spaces entirely. That would mean that dimension names would follow the exact same rules as python identifiers. @rpgoldman pointed out that in some cases column names of some dataset will be obvious choices (or even half-automatically imported) for dimension names, and they often contain spaces. So maybe that is a good reason for also allowing spaces within dim names?

About 3:
For pymc4 we have the opportunity to choose what we want without any regard for backward compatibility. I’d definitely disallow any special chars. That would make it possible to indicate hierarchy of parameters with # or /. But again: Do we allow spaces within names?

2 Likes

I am thinking about a serious response to this (and that “exploit” through matplotlib is quite creative), but first wanted to have some fun
image

4 Likes

Ok this is actually the future I wanted to live in😂

1 Like

There is also this:
image
Unfortunately it seems to be a special character…

Also, I would check the naming restrictions/ convention in tensorflow as they have name arg as well

ups, they don’t allow unicode: https://stackoverflow.com/questions/49237889/what-characters-are-allowed-in-tensorflow-variable-names

Yep, just looking at the same doc… Well that’s a bummer.

Yuck. But at least we have time to think out a solution to this for PyMC4.

Suggestion – this means we might want Arviz and other PyMC4 I/O pieces support some kind of rewriting for pretty display, so that we can keep our beloved Greek letters.

For example, Arviz could support pretty_dim and pretty_coord that would support unicode.

Alternatively, we could support some kind of markup in our names that would support LaTeX, since most of us know that. We’d have to figure out an alternative to \, but we could probably do that.

Another alternative would be to keep pretty names outside and have a systematic way to name tensor flow entities that would be predictable enough to help with debugging, but that would quarantine the tensorflow naming restrictions.

I don’t think we have to set the name argument of tf to our variable names. That might make debugging a bit harder under some circumstances, but I don’t think it has to stop us from allowing unicode.

In that case, I suggest coming up with some predictable map from our variable names to a more limited name that is acceptable to tf, but “predictable” to help us with debugging. Also, storing the name of the corresponding TF construct in our variable structure would be useful.