[英]clarifications on the re.findall() method in python
I wanted to strip a string of punctuation marks and I ended up using 我想剥离一串标点符号,最终我使用了
re.findall(r"[\w]+|[^\s\w]", text)
It works fine and it does solve my problem. 它工作正常,并且确实解决了我的问题。 What I don't understand is the details within the parentheses and the whole pattern thing.
我不明白的是括号内的详细信息以及整个模式 。 What does
r"[\\w]+|[^\\s\\w]"
really mean? r"[\\w]+|[^\\s\\w]"
真正含义是什么? I looked it up in the Python standard library and it says: 我在Python标准库中查找了它,并说:
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings.
返回字符串中模式的所有非重叠匹配项,作为字符串列表。 The string is scanned left-to-right, and matches are returned in the order found.
从左到右扫描该字符串,并以找到的顺序返回匹配项。 If one or more groups are present in the pattern, return a list of groups;
如果该模式中存在一个或多个组,则返回一个组列表;否则,返回一个列表。 this will be a list of tuples if the pattern has more than one group.
如果模式包含多个组,则这将是一个元组列表。 Empty matches are included in the result unless they touch the beginning of another match.
空匹配项将包括在结果中,除非它们碰到另一个匹配项的开头。
I am not sure if I get this and the clarification sounds a little vague to me. 我不确定我是否明白这一点,澄清对我来说有点模糊。 Can anyone please tell me what a pattern in this context means and how exactly it is defined in the findall() method?
谁能告诉我在这种情况下模式的含义以及在findall()方法中如何精确定义模式?
To break it down, []
creates a character class . 要分解它,
[]
创建一个字符类 。 You'll often see things like [abc]
which will match a
, b
or c
. 您经常会看到
[abc]
类的东西与a
, b
或c
匹配。 Conversely, you also might see [^abc]
will will match anything that isn't a
, b
or c
. 反之,你可能还会看到
[^abc]
将匹配任何不是 a
, b
或c
。 Finally, you'll also see character ranges: [a-cA-C]
. 最后,您还将看到字符范围:
[a-cA-C]
。 This introduces two ranges and it will match any of a
, b
, c
, A
, B
, C
. 这引入了两个范围,它将匹配
a
, b
, c
, A
, B
, C
。
In this case, your character class contains special tokens. 在这种情况下,您的角色类包含特殊标记。
\\w
and \\s
. \\w
和\\s
。 \\w
matches anything letter-like. \\w
匹配任何类似字母的东西。 \\w
actually depends on your locale, but it is usually the same thing as [a-zA-Z0-9_]
matches anything in the ranges az
, AZ
, 0-9
or _
. \\w
实际上取决于您的语言环境,但通常与[a-zA-Z0-9_]
匹配az
, AZ
, 0-9
或_
范围内的任何内容相同。 \\s
is similar, but it matches anything that can be considered whitespace. \\s
相似,但是它匹配任何可以被认为是空格的东西。
The +
means that you can repeat the previous match 1 or more times. +
表示您可以重复上一次匹配1次或多次。 so [a]+
will match the entire string aaaaaaaaaaa
. 因此
[a]+
将匹配整个字符串aaaaaaaaaaa
。 In your case, you're matching alphanumeric characters that are next to each other. 在您的情况下,您要匹配彼此相邻的字母数字字符。
the |
|
is basically like "or". 基本上就像“或”。 match the stuff on the left, or match the stuff on the right if the left stuff doesn't match.
匹配左边的东西, 或在右侧匹配的东西,如果左边的东西不匹配。
\\w
means Alphanumeric characters plus "_". \\w
表示字母数字字符加“ _”。 And \\s
means Whitespace characters including " \\t\\r\\n\\v\\f" and space character " ". \\s
表示空格字符,包括“ \\ t \\ r \\ n \\ v \\ f”和空格字符“”。 So, [\\w]+|[^\\s\\w]
means a string which contains only words and "_". 因此,
[\\w]+|[^\\s\\w]
表示仅包含单词和“ _”的字符串。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.