简体   繁体   English

([])+和[] +有什么区别?

[英]What's the difference between([])+ and []+?

>>> sentence = "Thomas Jefferson began building Monticello at the age of 26."
>>> tokens1 = re.split(r"([-\s.,;!?])+", sentence)
>>> tokens2 = re.split(r"[-\s.,;!?]+", sentence)
>>> tokens1 = ['Thomas', ' ', 'Jefferson', ' ', 'began', ' ', 'building', ' ', 'Monticello', ' ', 'at', ' ', 'the', ' ', 'age', ' ', 'of', ' ', '26', '.', '']
>>> tokens2 = ['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26', '']

Can you explain the purpose of ( and ) ? 您能解释一下()的目的吗?

(..) in a regex denotes a capturing group (aka "capturing parenthesis"). 正则表达式中的(..)表示捕获组 (也称为“捕获括号”)。 They are used when you want to extract values out of a pattern. 当您要从模式中提取值时使用它们。 In this case, you are using re.split function which behaves in a specific way when the pattern has capturing groups. 在这种情况下,您将使用re.split函数,当模式具有捕获组时, re.split函数将以特定方式运行。 According to the documentation: 根据文档:

re.split(pattern, string, maxsplit=0, flags=0) re.split(模式,字符串,maxsplit = 0,标志= 0)

Split string by the occurrences of pattern. 通过模式的出现来分割字符串。 If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. 如果在模式中使用了捕获括号,则模式中所有组的文本也将作为结果列表的一部分返回。

So normally, the delimiters used to split the string are not present in the result, like in your second example. 因此,通常,结果中不存在用于分割字符串的定界符,如第二个示例中所示。 However, if you use () , the text captured in the groups will also be in the result of the split. 但是,如果使用() ,则在组中捕获的文本也将是拆分的结果。 This is why you get a lot of ' ' in the first example. 这就是在第一个示例中得到很多' '的原因。 That is what is captured by your group ([-\\s.,;!?]) . 这就是您的小组([-\\s.,;!?])捕获的内容。

With a capturing group ( () ) in the regex used to split a string, split will include the captured parts. 正则表达式中的捕获组( () )用于分割字符串,split将包括捕获的部分。

In your case, you are splitting on one or more characters of whitespace and/or punctuation, and capturing the last of those characters to include in the split parts, which seems kind of a weird thing to do. 在您的情况下,您正在拆分一个或多个空格和/或标点符号,并捕获要包含在拆分部分中的这些字符的最后一个,这似乎有些奇怪。 I'd have expected you might want to capture all of the separator, which would look like r"([-\\s.,;!?]+)" (capturing one or more characters whitespace/punctuation characters, rather than matching one or more but only capturing the last). 我曾希望您可能想要捕获所有的分隔符,它们看起来像r"([-\\s.,;!?]+)" (捕获一个或多个字符的空格/标点字符,而不是匹配一个或更多,但仅捕获最后一个)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM