简体   繁体   English

python中的re.findall()方法的说明

[英]clarifications on the re.findall() method in python

I wanted to strip a string of punctuation marks and I ended up using 我想剥离一串标点符号,最终我使用了

re.findall(r"[\w]+|[^\s\w]", text)

It works fine and it does solve my problem. 它工作正常,并且确实解决了我的问题。 What I don't understand is the details within the parentheses and the whole pattern thing. 我不明白的是括号内的详细信息以及整个模式 What does r"[\\w]+|[^\\s\\w]" really mean? r"[\\w]+|[^\\s\\w]"真正含义是什么? I looked it up in the Python standard library and it says: 我在Python标准库中查找了它,并说:

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings. 返回字符串中模式的所有非重叠匹配项,作为字符串列表。 The string is scanned left-to-right, and matches are returned in the order found. 从左到右扫描该字符串,并以找到的顺序返回匹配项。 If one or more groups are present in the pattern, return a list of groups; 如果该模式中存在一个或多个组,则返回一个组列表;否则,返回一个列表。 this will be a list of tuples if the pattern has more than one group. 如果模式包含多个组,则这将是一个元组列表。 Empty matches are included in the result unless they touch the beginning of another match. 空匹配项将包括在结果中,除非它们碰到另一个匹配项的开头。

I am not sure if I get this and the clarification sounds a little vague to me. 我不确定我是否明白这一点,澄清对我来说有点模糊。 Can anyone please tell me what a pattern in this context means and how exactly it is defined in the findall() method? 谁能告诉我在这种情况下模式的含义以及在findall()方法中如何精确定义模式?

To break it down, [] creates a character class . 要分解它, []创建一个字符类 You'll often see things like [abc] which will match a , b or c . 您经常会看到[abc]类的东西与abc匹配。 Conversely, you also might see [^abc] will will match anything that isn't a , b or c . 反之,你可能还会看到[^abc]将匹配任何不是 abc Finally, you'll also see character ranges: [a-cA-C] . 最后,您还将看到字符范围: [a-cA-C] This introduces two ranges and it will match any of a , b , c , A , B , C . 这引入了两个范围,它将匹配abcABC

In this case, your character class contains special tokens. 在这种情况下,您的角色类包含特殊标记。 \\w and \\s . \\w\\s \\w matches anything letter-like. \\w匹配任何类似字母的东西。 \\w actually depends on your locale, but it is usually the same thing as [a-zA-Z0-9_] matches anything in the ranges az , AZ , 0-9 or _ . \\w实际上取决于您的语言环境,但通常与[a-zA-Z0-9_]匹配azAZ0-9_范围内的任何内容相同。 \\s is similar, but it matches anything that can be considered whitespace. \\s相似,但是它匹配任何可以被认为是空格的东西。

The + means that you can repeat the previous match 1 or more times. +表示您可以重复上一次匹配1次或多次。 so [a]+ will match the entire string aaaaaaaaaaa . 因此[a]+将匹配整个字符串aaaaaaaaaaa In your case, you're matching alphanumeric characters that are next to each other. 在您的情况下,您要匹配彼此相邻的字母数字字符。

the | | is basically like "or". 基本上就像“或”。 match the stuff on the left, or match the stuff on the right if the left stuff doesn't match. 匹配左边的东西, 在右侧匹配的东西,如果左边的东西不匹配。

\\w means Alphanumeric characters plus "_". \\w表示字母数字字符加“ _”。 And \\s means Whitespace characters including " \\t\\r\\n\\v\\f" and space character " ". \\s表示空格字符,包括“ \\ t \\ r \\ n \\ v \\ f”和空格字符“”。 So, [\\w]+|[^\\s\\w] means a string which contains only words and "_". 因此, [\\w]+|[^\\s\\w]表示仅包含单词和“ _”的字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM