澄清Python正则表达式和findall（）

Question

I came across this problem as I was working on the Python Challenge . 我在进行Python挑战时遇到了这个问题。 Number 10 to be exact. 确切地说是10号。 I decided to try and solve it using regexes - pulling out the repeating sequences, counting their length, and building the next item in the sequence off of that. 我决定尝试使用正则表达式来解决它-取出重复序列，计算它们的长度，然后从序列中构建下一个项目。

So the regex I developed was: '(\\d)\\1*' 所以我开发的正则表达式是： '(\\d)\\1*'

It worked well on the online regex tester , but when using it in my script it didn't perform the same: 它在在线正则表达式测试仪上运行良好，但是在我的脚本中使用它时，效果却不一样：

regex = re.compile('(\d)\1*')
text = '111122223333'
re.findall(regex, text)

> ['1', '1', '1', '1', '2', '2', '2',...]

And so on and so forth. 等等等等。 So I learn about raw type in the re module for Python. 因此，我在Python的re模块中了解了原始类型。 Which is my first question: can someone please explain what exactly this does? 我的第一个问题是：有人可以解释一下这到底是什么吗？ The doc described it as reducing the need to escape backslashes, but it doesn't appear that it's required for simpler regexes such as \\d+ and I don't understand why. 该文档将其描述为减少了对反斜杠进行转义的需求，但似乎并没有像\\d+这样的更简单的正则表达式需要它，而且我不明白为什么。

So I change my regex to r'(\\d)\\1*' and now try and use findall() to make a list of the sequences. 因此，我将正则表达式更改为r'(\\d)\\1*'然后尝试使用findall()创建序列列表。 And I get 我得到

> ['1', '2', '3']

Very confused again. 再次很困惑。 I still don't understand this. 我还是不明白。 Help please? 请帮助？

I decided to do this to get around this: 我决定这样做以解决此问题：

[m.group() for m in regex.finditer(text)]
> ['1111', '2222', '3333']

And get what I've been looking for. 得到我一直在寻找的东西。 Then, based off of this thread, I try doing findall() adding a group to the whole regex -> r'((\\d)\\2*)' . 然后，基于该线程，我尝试执行findall()将组添加到整个正则表达式-> r'((\\d)\\2*)' 。 I end up getting: 我最终得到：

> [('1111', '1'), ('2222', '2'), ('3333', '3')]

At this point I'm all kinds of confused. 在这一点上，我感到很困惑。 I know that this result has something to do with multiple groups, but I'm just not sure. 我知道这个结果与多个组有关，但是我不确定。

Also, this is my first time posting so I apologize if my etiquette isn't correct. 另外，这是我的第一次发贴，因此如果我的礼节不正确，我深表歉意。 Please feel free to correct me on that as well. 也请随时纠正我。 Thanks! 谢谢！

Answer 1

Since this is the challenge I won't give you a complete answer. 由于这是挑战，因此我不会为您提供完整的答案。 You are on the right track however. 但是，您在正确的轨道上。

The finditer method returns MatchObject instances . finditer方法返回MatchObject实例。 You want to look at the .group() method on these and read the documentation carefully. 您想查看这些方法上的.group()方法并仔细阅读文档。 Think about what the difference is between .group(0) and .group(1) there; 考虑那里的.group(0)和.group(1)有什么区别； plain .group() is the same as .group(0) . plain .group()与.group(0)相同。

As for the \\d escape character; 至于\\d转义字符； because that particular escape combination has no meaning as a python string escape character, Python ignores it and leaves it as a backslash and letter d . 因为该特定的转义组合作为python字符串转义字符没有意义，所以Python会忽略它并将其保留为反斜杠和字母d 。 It would indeed be better to use the r'' literal string format, as it would prevent nasty surprises when you do want to use a regular expression character set that also happens to be an escape sequence python does recognize. 确实最好使用r''文字字符串格式，因为当您确实想使用正则表达式字符集（也恰好是python确实识别的转义序列）时， 这样做可以避免令人讨厌的意外。 See the python documentation on string literals for more information . 有关更多信息，请参见有关字符串文字的python文档。

Your .findall() with the r'((\\d)\\2*)' expression returns 2 elements per match as you have 2 groups in your pattern; 带有r'((\\d)\\2*)'表达式的.findall()每次匹配返回2个元素，因为您的模式中有2个组。 the outer, whole group matching (\\d)\\2* and the inner group matching \\d . 外部，整个组匹配(\\d)\\2* ，内部组匹配\\d 。 From the .findall() documentation : 从.findall()文档中：

If one or more groups are present in the pattern, return a list of groups; 如果该模式中存在一个或多个组，则返回一个组列表；否则，返回一个列表。 this will be a list of tuples if the pattern has more than one group. 如果模式包含多个组，则这将是一个元组列表。

澄清Python正则表达式和findall（）

问题描述

1 个解决方案

解决方案1
1 已采纳 2012-07-23 16:21:53

澄清Python正则表达式和findall（）

问题描述

1 个解决方案

解决方案1 1 已采纳 2012-07-23 16:21:53

解决方案1
1 已采纳 2012-07-23 16:21:53