简体   繁体   English

Python与Perl正则表达式中的反斜杠和转义字符

[英]Backslashes and escaping chars in Python vs Perl regexes

The goal is to deal with the task of tokenization in NLP and porting the script from the Perl script to this Python script . 目标是处理NLP中的标记化任务,并将脚本从Perl脚本移植到此Python脚本

The main issues comes with erroneous backslashes that happens when we run the Python port of the tokenizer. 主要问题与错误的反斜杠有关,当我们运行令牌生成器的Python端口时,就会发生反斜杠。

In Perl, we could need to escape the single quotes and the ampersand as such: 在Perl中,我们可能需要这样来转义单引号和“&”号:

my($text) = @_; # Reading a text from stdin

$text =~ s=n't = n't =g; # Puts a space before the "n't" substring to tokenize english contractions like "don't" -> "do n't".

$text =~ s/\'/\'/g;  # Escape the single quote so that it suits XML.

Porting the regex literally into Python 从字面上将正则表达式移植到Python中

>>> import re
>>> from six import text_type
>>> sent = text_type("this ain't funny")
>>> escape_singquote = r"\'", r"\'" # escape the left quote for XML
>>> contraction = r"n't", r" n't" # pad a space on the left when "n't" pattern is seen
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
...     text = re.sub(regexp, substitution, text)
...     print text
... 
this ai n't funny
this ai n\'t funny

The escaping of the ampersand somehow added it as a literal backslash =( 转义符号的转义以某种方式将其添加为反斜杠=(

To resolve that, I could do: 为了解决这个问题,我可以这样做:

>>> escape_singquote = r"\'", r"'" # escape the left quote for XML
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
...     text = re.sub(regexp, substitution, text)
...     print text
... 
this ai n't funny
this ai n't funny

But seemingly without escaping the single quote in Python, we get the desired result too: 但似乎没有在Python中转义单引号,我们也得到了预期的结果:

>>> import re
>>> from six import text_type
>>> sent = text_type("this ain't funny")
>>> escape_singquote = r"\'", r"\'" # escape the left quote for XML
>>> contraction = r"n't", r" n't" # pad a space on the left when "n't" pattern is seen
>>> escape_singquote = r"'", r"'" # escape the left quote for XML
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
...     text = re.sub(regexp, substitution, text)
...     print text
... 
this ai n't funny
this ai n't funny

Now that's puzzling... 现在令人费解...

Given the context above, so the question is for which characters do we need to escape in Python and which characters in Perl? 给定上面的上下文,所以问题是我们需要在Python中转义哪些字符以及在Perl中转义哪些字符? Regex in Perl and Python is not that equivalent right? Perl和Python中的Regex不是那么等效吗?

In both Perl and Python, you have to escape the following regex metacharacters if you want to match them literally outside of a character class 1 : 在Perl和Python中,如果要在字符类1之外按字面意义匹配以下正则表达式元字符,则必须转义它们:

{}[]()^$.|*+?\

Inside a character class, you have to escape metacharacters according to these rules 2 : 在字符类内部,必须根据以下规则2来转义元字符:

     Perl                          Python
-------------------------------------------------------------
-    unless at beginning or end    unless at beginning or end
]    always                        unless at beginning
\    always                        always
^    only if at beginning          only if at beginning
$    always                        never

Note that neither single quote ' nor ampersand & must be escaped, whether inside or outside a character class. 请注意,无论是单引号'也不符号&必须进行转义,里面是否或字符类的外部。

However, both Perl and Python will ignore the backslash if you use it to escape a punctuation character that isn't a metacharacter (eg \\' is equivalent to ' inside a regex). 但是,如果你用它来逃脱一个标点符号是不是元字符都Perl和Python会忽略反斜杠(例如\\'等同于'正则表达式中)。


You seem to be getting tripped up by Python's raw strings : 您似乎被Python的原始字符串绊倒了:

When an 'r' or 'R' prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string. 如果存在'r''R'前缀,则字符串中包含反斜杠后面的字符而不会更改,并且所有反斜杠都保留在字符串中。

r"\\'" is the string \\' (literal backslash, literal single quote), while r'\\'' r"\\'"是字符串\\' (字面反斜杠,字面单引号),而r'\\'' is the string \\' 是字符串\\' (literal backslash, literal ampersand, etc.). (文字反斜杠,文字&符等)。

So this: 所以这:

re.sub(r"\'", r'\'', text)

replaces all single quotes with the literal text \\' 用文字文本\\'替换所有单引号\\' .


Putting it all together, your Perl substitution is better written: 综上所述,您的Perl替代词写得更好:

$text =~ s/'/'/g;

And your Python substitution is better written: 而且您的Python替代文字写得更好:

re.sub(r"'", r''', text)

  1. Python 2, Python 3, and current versions of Perl treat non-escaped curly braces as literal curly braces if they aren't part of a quantifier. 如果Python 2,Python 3和当前版本的Perl不属于量词,则它们会将未转义的花括号视为文字花括号。 However, this will be a syntax error in future versions of Perl, and recent versions of Perl give a warning. 但是,这将在Perl的将来版本中出现语法错误,并且Perl的最新版本会发出警告。

  2. See perlretut , perlre , and the Python docs for the re module . 有关re模块 ,请参见perlretutperlre和Python文档。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM