Restrictions for dimension and variable names

aseyboldt · July 30, 2019, 6:16pm

While writing a pull request that adds dimension names as alternative to numerical shapes, I wondered what restriction we want to enforce for those names.
The relevant PR is here.
I also noticed that currently we do not have any restrictions for the names of variables. This is perfectly legal code:

with pm.Model():
    pm.Normal('')
    pm.Normal('\n/ ')
    pm.Normal(' ')
    # let's hope nobody has `text.usetex` enabled in matplotlib :-)
    pm.Normal(r'\immediate\write18{/usr/bin/rm -rf /}')
    trace = pm.sample()

I think this raises three questions:

Can we add minimal restrictions for variable names without breaking user code?
Should we have similar or stronger restrictions for dimension names when we merge that PR? Here we have no backward compatibility issues.
What restrictions should we have in pymc4?

About 1:
We could limit variable names in a similar way as python identifiers. But since quite a few people seem to use spaces in variable names, we should at least allow spaces within the name. A regular expression like that might work well: r"^[^\d\W][\w ]*(?<=\w)\Z". You can test what it does with var = re.compile(r"^[^\d\W][\w ]*(?<=\w)\Z", re.UNICODE) and var.match("some_var_name") is not None. That would mean: No spaces or digits as first character. No special characters like \ / # $ \n \t no whitespace at the end of the varname. We could start printing warnings for variable names that do not conform to that format, and in a later release throw an exception.

About 2:
We could just have the same restrictions as for variable names, but as this is new we could also disallow spaces entirely. That would mean that dimension names would follow the exact same rules as python identifiers. @rpgoldman pointed out that in some cases column names of some dataset will be obvious choices (or even half-automatically imported) for dimension names, and they often contain spaces. So maybe that is a good reason for also allowing spaces within dim names?

About 3:
For pymc4 we have the opportunity to choose what we want without any regard for backward compatibility. I’d definitely disallow any special chars. That would make it possible to indicate hierarchy of parameters with # or /. But again: Do we allow spaces within names?

colcarroll · July 30, 2019, 6:30pm

I am thinking about a serious response to this (and that “exploit” through matplotlib is quite creative), but first wanted to have some fun

junpenglao · July 30, 2019, 7:32pm

Ok this is actually the future I wanted to live in😂

aseyboldt · July 30, 2019, 7:48pm

There is also this:

Unfortunately it seems to be a special character…

junpenglao · July 30, 2019, 8:08pm

Also, I would check the naming restrictions/ convention in tensorflow as they have name arg as well

aseyboldt · July 30, 2019, 8:12pm

ups, they don’t allow unicode: https://stackoverflow.com/questions/49237889/what-characters-are-allowed-in-tensorflow-variable-names

junpenglao · July 30, 2019, 8:12pm

Yep, just looking at the same doc… Well that’s a bummer.

rpgoldman · July 31, 2019, 1:27pm

Yuck. But at least we have time to think out a solution to this for PyMC4.

Suggestion – this means we might want Arviz and other PyMC4 I/O pieces support some kind of rewriting for pretty display, so that we can keep our beloved Greek letters.

For example, Arviz could support pretty_dim and pretty_coord that would support unicode.

Alternatively, we could support some kind of markup in our names that would support LaTeX, since most of us know that. We’d have to figure out an alternative to \, but we could probably do that.

Another alternative would be to keep pretty names outside and have a systematic way to name tensor flow entities that would be predictable enough to help with debugging, but that would quarantine the tensorflow naming restrictions.

aseyboldt · July 31, 2019, 1:49pm

I don’t think we have to set the name argument of tf to our variable names. That might make debugging a bit harder under some circumstances, but I don’t think it has to stop us from allowing unicode.

rpgoldman · July 31, 2019, 2:04pm

In that case, I suggest coming up with some predictable map from our variable names to a more limited name that is acceptable to tf, but “predictable” to help us with debugging. Also, storing the name of the corresponding TF construct in our variable structure would be useful.

Topic		Replies	Views
Variable name restrictions Questions	0	357	May 1, 2021
How can I name the dimensions of my variables? v5 modeling	2	342	September 29, 2022
Understanding dimensions/shapes of variables v5	5	1270	August 29, 2023
Dims in pm.Data v5	5	770	November 22, 2023
Custom naming of prefixed output variables Questions	7	1632	July 27, 2023

Restrictions for dimension and variable names

Related topics