简体   繁体   中英

Python Regex match every other word

I've been trying and searching for a solution to match every other word in python using regex. The string is comma separated with unknown length.

Say I have the following string:

"keep, ignore, keep_this_too, ignore, keep_this_also, ignore"

I would like to be able to keep all the matching words as a list.

I tried writing my regex as:

((?P<keep>.*),)*

then using

result = re.match(regex, string)
print result.group(keep)

in attempt to printing out all matching words, instead I just get everything but the last word.

Thanks

Edit:

I cannot use any Python string operation. The goal of this is to support any data format provided by researchers, to do this we are storing regex in a database for each format. For example, they could provide a data format where we have to use the following regex:

"keep (ignore), keep (ignore), keep (ignore)"

.* matches greedily (matched everything if possible); .*, match everything until the last , . To match non-greedily, use .*? .

And re.match returns only the first match. (and matches only at the beginning of the input string). (See search() vs match() )

Using re.findall with the modified regular expression:

>>> s = "keep, ignore, keep_this_too, ignore, keep_this_also, ignore"
>>> re.findall(r'([^,\s]+)', s)
['keep', 'ignore', 'keep_this_too', 'ignore', 'keep_this_also', 'ignore']
>>> re.findall(r'([^,\s]+)', s)[::2] # using slice to get every other matches.
['keep', 'keep_this_too', 'keep_this_also']

or:

>>> re.findall(r'([^,\s]+)(?:,\s*[^,\s]+)?', s)
['keep', 'keep_this_too', 'keep_this_also']

You could still store .split() in a database instead?

String="keep, ignore, keep_this_too, ignore, keep_this_also, ignore"
String.split(",")[0::2]

Output:

['keep', ' keep_this_too', ' keep_this_also']

Regexes already define what characters can appear in a word, namely \\w denotes such set. Hence:

In [1]: import re
   ...: re.findall('\w+', "keep, ignore, keep_this_too, ignore, keep_this_also, ignore")
   ...: 
Out[1]: ['keep', 'ignore', 'keep_this_too', 'ignore', 'keep_this_also', 'ignore']

If you want to ignore every other match simply use slicing:

In [2]: ['keep', 'ignore', 'keep_this_too', 'ignore', 'keep_this_also', 'ignore'][::2]
Out[2]: ['keep', 'keep_this_too', 'keep_this_also']

If you want to keep only strings that start with keep (or an other substring), simply use the pattern keep\\w* instead of \\w+ :

In [4]: re.findall('keep\w*', "keep, ignore, keep_this_too, ignore, keep_this_also, ignore")
Out[4]: ['keep', 'keep_this_too', 'keep_this_also']

If what you are trying to match is not really a word, ie it can contain characters such as spaces, punctuation etc., then you can replace \\w with [^,] in the regexes above to match everything except the comma.

You could use something like:

import re
re.findall("([^,]*), [^,]+[,]{0,1}", "keep, ignore, keep_this_too, ignore, keep_this_also, ignore")

But why not just use split and slice the result:

"keep, ignore, keep_this_too, ignore, keep_this_also, ignore".split(",")[0::2]

You need this:

s = ' keep, ignore,  keep_this_too  , ignore, keep_this_also, ignore '
print(s.replace(' ','').split(',')[0::2])

yields:

['keep', 'keep_this_too', 'keep_this_also']

this?

>>> s = "keep, ignore, keep_this_too, ignore, keep_this_also, ignore"
>>> import re
>>> re.findall(r'(\w+)\W+\w+', s)
['keep', 'keep_this_too', 'keep_this_also']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM