简体   繁体   English

在python正则表达式中使用\\r(回车)

[英]Use of \r (carriage return) in python regex

I'm trying to use regex to match every character between a string and a \\r character :我正在尝试使用正则表达式来匹配字符串和\\r字符之间的每个字符:

text = 'Some text\rText to find !\r other text\r'

I want to match 'Text to find !'我想匹配'Text to find !' . . I already tried :我已经尝试过:

re.search(r'Some text\r(.*)\r', text).group(1)

But it gives me : 'Text to find !\\r other text'但它给了我: 'Text to find !\\r other text'

It's surprising because it works perfectly when replacing \\r by \\n :这很令人惊讶,因为它在用\\n替换\\r时效果很好:

re.search(r'Some text\n(.*)\n', 'Some text\nText to find !\n other text\n').group(1)

returns Text to find !返回Text to find !

Do you know why it behaves differently when we use \\r and \\n ?你知道为什么当我们使用\\r\\n时它的行为不同吗?

That is correct and expected behavior since .这是正确的和预期的行为,因为. by default in Python re does not match LF chars only, it matches CR (carriage return) chars.默认情况下,Python re不只匹配 LF 字符,它匹配 CR(回车)字符。

See the re documentation :请参阅re文档

.
(Dot.) In the default mode, this matches any character except a newline. (点。)在默认模式下,这匹配除换行符之外的任何字符。 If the DOTALL flag has been specified, this matches any character including a newline.如果指定了DOTALL 标志,则它匹配包括换行符在内的任何字符。

You can easily check that with the following code :您可以使用以下代码轻松检查:

import re
unicode_lbr = '\n\v\f\r\u0085\u2028\u2029'
print( re.findall(r'.+', f'abc{unicode_lbr}def') )
# => ['abc', '\x0b\x0c\r\x85\u2028\u2029def']

To match between two carriage return chars you need to use the negated character class:要在两个回车符之间进行匹配,您需要使用否定字符类:

r'Some text\r([^\r]*)\r'
r'Some text\r([^\r]*)'   # if the trailing CR char does not have to exist

In case you want to match between the leftmost and rightmost occurrences of \\r chars (the outer CR chars) including any chars in between you can use a mere .* with re.DOTALL :如果您想在最左边和最右边出现的\\r字符(外部 CR 字符)之间进行匹配,包括中间的任何字符,您可以仅使用.*re.DOTALL

re.search(r'(?s)Some text\r(.*)\r', text)
re.search(r'Some text\r(.*)\r', text, re.DOTALL)

where (?s) is an inline modifier equal to re.DOTALL / re.S .其中(?s)是等于re.DOTALL / re.S的内联修饰符。

.* is greedy in nature so it is matching longest match available in: .*本质上是贪婪的,所以它匹配可用的最长匹配:

r'Some text\r(.*)\r

Hence giving you:因此给你:

re.findall(r'Some text\r(.*)\r', 'Some text\rText to find !\r other text\r')
['Text to find !\r other text']

However if you change to non-greedy then it gives expected result as in:但是,如果您更改为非贪婪,则它会给出预期结果,如下所示:

re.findall(r'Some text\r(.*?)\r', 'Some text\rText to find !\r other text\r')
['Text to find !']

Reason why re.findall(r'Some text\\n(.*)\\n', 'Some text\\nText to find !\\n other text\\n') gives just ['Text to find !'] is that DOT matches any character except line break and \\n is a line break. re.findall(r'Some text\\n(.*)\\n', 'Some text\\nText to find !\\n other text\\n')只给出['Text to find !']是 DOT 匹配除换行符和\\n之外的任何字符都是换行符。 If you enable DOTALL then again it will match longest match in:如果您启用DOTALL ,它将再次匹配以下最长匹配项:

>>> re.findall(r'Some text\n([\s\S]*)\n', 'Some text\nText to find !\n other text\n')
['Text to find !\n other text']

>>> re.findall(r'(?s)Some text\n(.*)\n', 'Some text\nText to find !\n other text\n')
['Text to find !\n other text']

Which again changes behavior when you use non-greedy quantifier:当您使用非贪婪量词时,这再次改变了行为:

re.findall(r'(?s)Some text\n(.*?)\n', 'Some text\nText to find !\n other text\n')
['Text to find !']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM