python中的re.findall（）方法的说明

Question

I wanted to strip a string of punctuation marks and I ended up using 我想剥离一串标点符号，最终我使用了

re.findall(r"[\w]+|[^\s\w]", text)

It works fine and it does solve my problem. 它工作正常，并且确实解决了我的问题。 What I don't understand is the details within the parentheses and the whole pattern thing. 我不明白的是括号内的详细信息以及整个模式。 What does r"[\\w]+|[^\\s\\w]" really mean? r"[\\w]+|[^\\s\\w]"真正含义是什么？ I looked it up in the Python standard library and it says: 我在Python标准库中查找了它，并说：

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings. 返回字符串中模式的所有非重叠匹配项，作为字符串列表。 The string is scanned left-to-right, and matches are returned in the order found. 从左到右扫描该字符串，并以找到的顺序返回匹配项。 If one or more groups are present in the pattern, return a list of groups; 如果该模式中存在一个或多个组，则返回一个组列表；否则，返回一个列表。 this will be a list of tuples if the pattern has more than one group. 如果模式包含多个组，则这将是一个元组列表。 Empty matches are included in the result unless they touch the beginning of another match. 空匹配项将包括在结果中，除非它们碰到另一个匹配项的开头。

I am not sure if I get this and the clarification sounds a little vague to me. 我不确定我是否明白这一点，澄清对我来说有点模糊。 Can anyone please tell me what a pattern in this context means and how exactly it is defined in the findall() method? 谁能告诉我在这种情况下模式的含义以及在findall（）方法中如何精确定义模式？

Answer 1

To break it down, [] creates a character class . 要分解它， []创建一个字符类 。 You'll often see things like [abc] which will match a , b or c . 您经常会看到[abc]类的东西与a ， b或c匹配。 Conversely, you also might see [^abc] will will match anything that isn't a , b or c . 反之，你可能还会看到[^abc]将匹配任何不是 a ， b或c 。 Finally, you'll also see character ranges: [a-cA-C] . 最后，您还将看到字符范围： [a-cA-C] 。 This introduces two ranges and it will match any of a , b , c , A , B , C . 这引入了两个范围，它将匹配a ， b ， c ， A ， B ， C 。

In this case, your character class contains special tokens. 在这种情况下，您的角色类包含特殊标记。 \\w and \\s . \\w和\\s 。 \\w matches anything letter-like. \\w匹配任何类似字母的东西。 \\w actually depends on your locale, but it is usually the same thing as [a-zA-Z0-9_] matches anything in the ranges az , AZ , 0-9 or _ . \\w实际上取决于您的语言环境，但通常与[a-zA-Z0-9_]匹配az ， AZ ， 0-9或_范围内的任何内容相同。 \\s is similar, but it matches anything that can be considered whitespace. \\s相似，但是它匹配任何可以被认为是空格的东西。

The + means that you can repeat the previous match 1 or more times. +表示您可以重复上一次匹配1次或多次。 so [a]+ will match the entire string aaaaaaaaaaa . 因此[a]+将匹配整个字符串aaaaaaaaaaa 。 In your case, you're matching alphanumeric characters that are next to each other. 在您的情况下，您要匹配彼此相邻的字母数字字符。

the | | is basically like "or". 基本上就像“或”。 match the stuff on the left, or match the stuff on the right if the left stuff doesn't match. 匹配左边的东西，或在右侧匹配的东西，如果左边的东西不匹配。

Answer 2

\\w means Alphanumeric characters plus "_". \\w表示字母数字字符加“ _”。 And \\s means Whitespace characters including " \\t\\r\\n\\v\\f" and space character " ". \\s表示空格字符，包括“ \\ t \\ r \\ n \\ v \\ f”和空格字符“”。 So, [\\w]+|[^\\s\\w] means a string which contains only words and "_". 因此， [\\w]+|[^\\s\\w]表示仅包含单词和“ _”的字符串。

python中的re.findall（）方法的说明

问题描述

2 个解决方案

解决方案1
1 已采纳 2013-04-03 04:29:45

解决方案2
0 2013-04-03 04:29:53

python中的re.findall（）方法的说明

问题描述

2 个解决方案

解决方案1 1 已采纳 2013-04-03 04:29:45

解决方案2 0 2013-04-03 04:29:53

解决方案1
1 已采纳 2013-04-03 04:29:45

解决方案2
0 2013-04-03 04:29:53