简体   繁体   中英

Escape sequences vs predefined character classes (aka special regex characters) when encapsulated by double quotes

Perl, like Java and Python, has \\s , the special regex character that matches whitespace, in addition to other special characters.

In Perl, the following would not be valid:

my $sentence = "The End";
my $subStr = "\s"; #Does NOT work, needs to be "\\s" or '\s'

if ($sentence =~ /$subStr/)
{
    say "True";
}

In Java, this would be valid:

String s = "The End";

if (s.matches(".*\\s.*")) //same deal as with Perl ("\\s")
{
    System.out.println("True");
}

In Python, one could use either "\\s" or '\\s' .

Both Java and Perl seem to treat special regex characters encapsulated by "" the same. I looked up Predefined Character Classes (Java), and it simply said: "If you are using an escaped construct within a string literal, you must precede the backslash with another backslash for the string to compile."

Why do both Java and Perl treat escape sequences differently than special regex characters (when they're both encapsulated by "" ), yet, python doesn't?

As in why did the designers make the choice for escape sequences, like \\n or \\t , to require one backslash, but for predefined character classes, like \\s , to require two (while in "" )?

Is this a consequence of something else? Or does it in some way simplify some sort of interaction(s) or what have you?

I'm going to assume that it wasn't arbitrary. Python only requires \\ either way, yet Perl and Java mandate \\\\ when dealing with "" . Besides being a little confusing, it's just messy. So, I assume that there's a good reason for this decision. Anyone know why?

Java, Perl, and Python all use C-style backslashes for escapes. Regex also uses C-style backslashes for escapes. This leads to problems in all three languages—and, in fact, for many, many other languages.

For example, all three languages will convert '\\\\' into a single backslash, '\\n' into a newline, etc., before they can get to the regex compiler.

The only difference is that in Python, unknown escape sequences like '\\s' resolve to themselves, while in Java and Perl they resolve to just 's' . So, in Python, while you need '\\\\\\n' , you don't need '\\\\s' , while in Java and Perl, you need to escape the backslashes for both.

And there are languages that make the third choice, treating unknown escape sequences as errors.


So, if you have the list of known escapes memorized, you can sometimes get away with not escaping backslashes in Python. But you really shouldn't.

Why not? Because, even if you're absolutely sure you've memorized the escape sequences, do you really want to make that a requirement for anyone who wants to read (or maintain) your code? When I see "abc\\\\sdef" or r"abc\\sdef" , I immediately know exactly what it means. When I see unescaped "abc\\sdef" , I think I know, but I may be wrong, and I have to go look it up or try it in the interpreter to find out.


The right thing to do is to escape your backslashes, or use the appropriate raw-string or regex-literal syntax for your language.


If you're wondering why Python made a different design choice for unknown escapes from Perl and Java… As far as I know, that's not covered in the official Design FAQ and hasn't been directly addressed by Guido. But I can guess. In general, Perl went with maximal compatibility with C (and Java with C++) as a high priority in many areas, where Python put more priority on what made more intuitive sense to a programming teacher. This is probably one of those areas. (I suspect if Python were re-designed from scratch today, or even way back when raw strings were added, it would go with the error.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM