简体   繁体   English

澄清Python正则表达式和findall()

[英]Clarification on Python regexes and findall()

I came across this problem as I was working on the Python Challenge . 我在进行Python挑战时遇到了这个问题。 Number 10 to be exact. 确切地说是10号。 I decided to try and solve it using regexes - pulling out the repeating sequences, counting their length, and building the next item in the sequence off of that. 我决定尝试使用正则表达式来解决它-取出重复序列,计算它们的长度,然后从序列中构建下一个项目。

So the regex I developed was: '(\\d)\\1*' 所以我开发的正则表达式是: '(\\d)\\1*'

It worked well on the online regex tester , but when using it in my script it didn't perform the same: 它在在线正则表达式测试仪上运行良好,但是在我的脚本中使用它时,效果却不一样:

regex = re.compile('(\d)\1*')
text = '111122223333'
re.findall(regex, text)

> ['1', '1', '1', '1', '2', '2', '2',...]

And so on and so forth. 等等等等。 So I learn about raw type in the re module for Python. 因此,我在Python的re模块中了解了原始类型。 Which is my first question: can someone please explain what exactly this does? 我的第一个问题是:有人可以解释一下这到底是什么吗? The doc described it as reducing the need to escape backslashes, but it doesn't appear that it's required for simpler regexes such as \\d+ and I don't understand why. 该文档将其描述为减少了对反斜杠进行转义的需求,但似乎并没有像\\d+这样的更简单的正则表达式需要它,而且我不明白为什么。

So I change my regex to r'(\\d)\\1*' and now try and use findall() to make a list of the sequences. 因此,我将正则表达式更改为r'(\\d)\\1*'然后尝试使用findall()创建序列列表。 And I get 我得到

> ['1', '2', '3']

Very confused again. 再次很困惑。 I still don't understand this. 我还是不明白。 Help please? 请帮助?

I decided to do this to get around this: 我决定这样做以解决此问题:

[m.group() for m in regex.finditer(text)]
> ['1111', '2222', '3333']

And get what I've been looking for. 得到我一直在寻找的东西。 Then, based off of this thread, I try doing findall() adding a group to the whole regex -> r'((\\d)\\2*)' . 然后,基于线程,我尝试执行findall()将组添加到整个正则表达式-> r'((\\d)\\2*)' I end up getting: 我最终得到:

> [('1111', '1'), ('2222', '2'), ('3333', '3')]

At this point I'm all kinds of confused. 在这一点上,我感到很困惑。 I know that this result has something to do with multiple groups, but I'm just not sure. 我知道这个结果与多个组有关,但是我不确定。

Also, this is my first time posting so I apologize if my etiquette isn't correct. 另外,这是我的第一次发贴,因此如果我的礼节不正确,我深表歉意。 Please feel free to correct me on that as well. 也请随时纠正我。 Thanks! 谢谢!

Since this is the challenge I won't give you a complete answer. 由于这是挑战,因此我不会为您提供完整的答案。 You are on the right track however. 但是,您在正确的轨道上。

The finditer method returns MatchObject instances . finditer方法返回MatchObject实例 You want to look at the .group() method on these and read the documentation carefully. 您想查看这些方法上的.group()方法并仔细阅读文档。 Think about what the difference is between .group(0) and .group(1) there; 考虑那里的.group(0).group(1)有什么区别; plain .group() is the same as .group(0) . plain .group().group(0)相同。

As for the \\d escape character; 至于\\d转义字符; because that particular escape combination has no meaning as a python string escape character, Python ignores it and leaves it as a backslash and letter d . 因为该特定的转义组合作为python字符串转义字符没有意义,所以Python会忽略它并将其保留为反斜杠和字母d It would indeed be better to use the r'' literal string format, as it would prevent nasty surprises when you do want to use a regular expression character set that also happens to be an escape sequence python does recognize. 确实最好使用r''文字字符串格式,因为当您确实想使用正则表达式字符集(也恰好是python确实识别的转义序列)时, 这样做可以避免令人讨厌的意外。 See the python documentation on string literals for more information . 有关更多信息,请参见有关字符串文字python文档

Your .findall() with the r'((\\d)\\2*)' expression returns 2 elements per match as you have 2 groups in your pattern; 带有r'((\\d)\\2*)'表达式的.findall()每次匹配返回2个元素,因为您的模式中有2个组。 the outer, whole group matching (\\d)\\2* and the inner group matching \\d . 外部,整个组匹配(\\d)\\2* ,内部组匹配\\d From the .findall() documentation : .findall()文档中

If one or more groups are present in the pattern, return a list of groups; 如果该模式中存在一个或多个组,则返回一个组列表;否则,返回一个列表。 this will be a list of tuples if the pattern has more than one group. 如果模式包含多个组,则这将是一个元组列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM