简体   繁体   English

如何建立一个正则表达式来捕获由单个空格分隔的单词?

[英]How can I build a regular expression that captures words separated by single spaces?

I want to build a regular expression that captures 我想建立一个捕获的正则表达式

Fee fie foe foo!

but when there are more than one space: 但是当有多个空间时:

Fee fie  foe foo!

only captures "Fee fie" 只捕获“费国际剑联”

My regex looks something like this: 我的正则表达式看起来像这样:

words_re = re.compile(r"\w[-\w .,!]*")

which you can see captures any sequence starting with alphanumeric and followed by any combination of alphanumeric, spaces, and a few chosen punctuation marks. 您可以看到该命令捕获任何以字母数字开头的序列,然后是字母数字,空格和一些所选标点符号的任意组合。 I just want to limit this to just one space at a time. 我只想一次将其限制为一个空间。

Alternatively, a variant of string.split() that returns the separating whitespace spans would do it for me too. 另外,string.split()的变体返回分隔的空白范围也可以为我做这件事。

The closest I've gotten is this: 我得到的最接近的是:

words_re = re.compile(r"\w[-\w.,!]*|\s+")
l = words_re.findall(s)

but I then need to search the returned list for sub-lists containing only single-space separators and then rebuild the strings from those. 但是我随后需要在返回的列表中搜索仅包含单空格分隔符的子列表,然后从那些列表中重建字符串。

One thought I had was to take the result from the above expression and then further split it with string.split(" ") to break it into sub-groups that were separated two spaces, but then what about the three-space case and so forth? 我曾经想到的是从上面的表达式中获取结果,然后用string.split(" ")进一步将其拆分为分成两个空格的子组,但是三空格的情况又如何呢?向前?

This will work 这会起作用

^(\w+(?:\s[-.!\w]+)*(?:[-.!\w]*$))

Regex Demo 正则表达式演示

If you want to match only upto one space string, you can use (This will match only from starting. You can remove the anchor if you want for capturing all possibility) 如果您只想匹配最多一个空格字符串,则可以使用(这将从开始就匹配。如果要捕获所有可能性,可以删除锚点)

^(\w[-.!\w]*(?:\s[-.!\w]+)*)

Regex Demo 正则表达式演示

Try out 试用

^((?:\w+(?: |[^ ]$))+)

You can see it live here 你可以在这里看到它

  • We first match a word with \\w 我们首先用\\w匹配一个单词
  • Then we allow it to be followed by one space, or anything but a space if that reach the end of the string (?: |[^ ]$) 然后,我们允许在其后跟一个空格,如果到达字符串的末尾(?: |[^ ]$) ,则除空格以外的任何空格。
  • We repeat to match every words followed by one space or until end is reached + 我们重复匹配每个单词,后跟一个空格,直到到达结尾+

Alternative solution without using a regex: 不使用正则表达式的替代解决方案:

import itertools

def up_to_double_space(str):
    return " ".join(itertools.takewhile(lambda word: word, str.split(" ")))

up_to_double_space("Fee fie foe foo!")
# 'Fee fie foe foo!'
up_to_double_space("Fee fie  foe foo!")
# 'Fee fie'

This is more of a comment than a solution, but I lack the rep for that, but there is a split solution that might work for you. 这更多的是评论,而不是解决方案,但是我没有代表,但是有一个分离的解决方案可能对您有用。 split takes a single argument and will split on that. split仅接受一个参数,并将对此进行拆分。 If you use the space as the argument an empty sting is inserted in the list (from between the two spaces). 如果将空格用作参数,则将空字符串插入列表(从两个空格之间)。 The downside is that other whitespace (tab, etc) will not cause a split. 缺点是其他空格(制表符等)不会引起拆分。

In [15]: x = 'fie fie  foo fum'

In [16]: x.split(' ')
Out[16]: ['fie', 'fie', '', 'foo', 'fum']

In [17]: x.split(' ')[:x.split(' ').index('')]
Out[17]: ['fie', 'fie']

It's also not selective about your punctuation, which might be an issue. 您对标点符号的选择也不是选择性的,这可能是一个问题。

In general I think a regex is the correct answer but in case this handles all your needs, it's a lot simpler to use and maintain. 总的来说,我认为正则表达式是正确的答案,但是万一这满足了您的所有需求,则使用和维护起来会容易得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM