简体   繁体   中英

What to do when unescapable character(s) are escaped?

In designing of a (mini)language: When there are certain characters that should be escaped to lose special meanings (like quotes in some programming languages), what should be done, especially from a security perspective, when characters that are not escapable (eg normal characters which never have special meaning) are escaped? Should an error be "error"ed, or should the character be discarded, or should it be in the output the same as if it was not escaped?

Example: In a simple language where strings are delimited by double-quotes( " ), and any quotes in a given string are escaped with a back-slash( \\ ): for input "We \\said, \\"We want Moshiach Now\\"" -- what would should be done with the letter s in said which is escaped?

I prefer the lexer to whine when this occurs. A lexer/parser should be tight about syntax; one can always loosen it up later. If you are sloppy, you'll find you can't retract a decision you didn't think you made.

Assume that you initially decide to treat " backslash not-an-escape " as that pair of characters, and the "T" is not-an-escape today. Sometime later you decide to extend the language, and want "\\T" to mean something special, and you change your language.

You'll find an angry mob of programmers storming your design castle, because for them, "\\T" means "\\" "T" (or "T" depending on your default decision), and you just broke their code. You hang your head in shame, retract the decision, and then realize... oops, there are no more available escape characters!

This lesson goes for any piece of syntax that isn't well defined in your language. If it isn't explicitly legal, it should be implicitly illegal and your compiler should check it. Or you'll never be able to extend your successful language.

If your language isn't going to be successful, you may not care as much.

Well, one way to solve the problem is for the backslash to just mean backslash when it precedes a non-escapable character. That's what Python does:

>>> print "a\tb"
a   b
>>> print "a\tb\Rc"
a   b\Rc

Obviously, most systems take the escape character to mean "take the next character verbatim", so escaping a "non-escapable" character is usually harmless. The problem later happens when you get to comparisons and such, where the literal text does not represent the actual value (that's where you see a lot of issues securitywise, especially with things like URLs).

So on the one hand, you can only accept a limited number of escaped characters. In that sense, you have an "escape sequence", rather than an escaped character (the \\x is the entire sequence rather than a \\ followed by an x). That's like the most safe mechanism, and it's not really burdensome to write.

The other option is to ensure that you you "canonicalizing" everything you compare, through some ruleset. This typically means removing all of the escape sequences properly up front, before comparison and comparing only the final values rather than the literals.

Most systems interpret the slash as Will Hartung says, except for alphanumerics which are variously used as aliases for control codes, character classes, word boundaries, the start of hex sequences, case region markers, hex or octal digits, etc. \\s in particular often means white-space in perl5 style regexs. JavaScript, which interprets it as 's' in one context and as whitespace in another suffers from subtle bugs because of this choice. Consider /foo\\sbar/ vs new RegExp('foo\\sbar') .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM