简体   繁体   中英

How to match anything except two or more consecutive spaces in a regex?

How to match anything except two or more consecutive spaces in a regex?

I have a test string like

string = ' a      title of foo        b '

I would like to capture title of foo from string. Basically, this means that we start with any number of spaces, followed by a combination of letters and spaces, but never more than one consecutive space, and then again by any number of spaces.

Attempt (in python).

string = '      title of foo        '
match = re.match('\s*([^\s{2,}])*\s*', string)

This doesn't work because the square brackets need a list, I think.

It would be easier to just use:

stripped_string = string.strip()

The function strip() removes the whitespace from the start and end of a string.

You can use this lookahead based regex:

>>> string = ' a      title of foo        b '

>>> print re.search(r'\S+(?:(?!\s{2}).)+', string).group()
title of foo

RegEx Demo

When you want to match everything except X, it's often simpler to split by X instead. In other words: Instead of inverting the regex, invert the operation.

In your case, just re.split by two or more spaces, ie \\s{2,} , and keep what remains.

>>> text = '      title of foo       more text   and some more     '
>>> re.split(r'\s{2,}', text)
['', 'title of foo', 'more text', 'and some more', '']

This will yield two additional empty matches at the very beginning and the end of the string, but you can easily get rid of them, eg using filter , or a list comprehension:

>>> filter(None, re.split(r'\s{2,}', text))
['title of foo', 'more text', 'and some more']

In my opinion, this is much simpler and more concise than a complex regex using lots of lookaheads and stuff to actually match the part that's not two or more spaces.

I would go with

/(\b\w+(?: \w+\b)+)/

regex101

You can use code generator on the left side of that page to give you this generated version:

import re
p = re.compile(ur'(\b\w+(?: \w+\b)+)')
test_str = u"string = ' a      title of foo        b '"

re.findall(p, test_str)

Your match would then contain only title of foo without any of the other strings containing more than a single space between the words.

If you don't know whether your characters will always be \\w word characters, but can contain anything other than whitespace, you can change \\w to \\S so it will match things like

rabbit :gold: !whisker?

as those contain only a single space between them.

I think this looks fairly clean but it does rely on double spaces existing either side of the text. I prefer anubhava's solution.

string = ' a      title of foo        b '
regex=r'(?<=  )(\S.*?\S?)(?=  )'
output=re.findall(regex, string)[0]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM