I've been trying and searching for a solution to match every other word in python using regex. The string is comma separated with unknown length.
Say I have the following string:
"keep, ignore, keep_this_too, ignore, keep_this_also, ignore"
I would like to be able to keep all the matching words as a list.
I tried writing my regex as:
((?P<keep>.*),)*
then using
result = re.match(regex, string)
print result.group(keep)
in attempt to printing out all matching words, instead I just get everything but the last word.
Thanks
Edit:
I cannot use any Python string operation. The goal of this is to support any data format provided by researchers, to do this we are storing regex in a database for each format. For example, they could provide a data format where we have to use the following regex:
"keep (ignore), keep (ignore), keep (ignore)"
.*
matches greedily (matched everything if possible); .*,
match everything until the last ,
. To match non-greedily, use .*?
.
And re.match
returns only the first match. (and matches only at the beginning of the input string). (See search() vs match() )
Using re.findall
with the modified regular expression:
>>> s = "keep, ignore, keep_this_too, ignore, keep_this_also, ignore"
>>> re.findall(r'([^,\s]+)', s)
['keep', 'ignore', 'keep_this_too', 'ignore', 'keep_this_also', 'ignore']
>>> re.findall(r'([^,\s]+)', s)[::2] # using slice to get every other matches.
['keep', 'keep_this_too', 'keep_this_also']
or:
>>> re.findall(r'([^,\s]+)(?:,\s*[^,\s]+)?', s)
['keep', 'keep_this_too', 'keep_this_also']
You could still store .split()
in a database instead?
String="keep, ignore, keep_this_too, ignore, keep_this_also, ignore"
String.split(",")[0::2]
Output:
['keep', ' keep_this_too', ' keep_this_also']
Regexes already define what characters can appear in a word, namely \\w
denotes such set. Hence:
In [1]: import re
...: re.findall('\w+', "keep, ignore, keep_this_too, ignore, keep_this_also, ignore")
...:
Out[1]: ['keep', 'ignore', 'keep_this_too', 'ignore', 'keep_this_also', 'ignore']
If you want to ignore every other match simply use slicing:
In [2]: ['keep', 'ignore', 'keep_this_too', 'ignore', 'keep_this_also', 'ignore'][::2]
Out[2]: ['keep', 'keep_this_too', 'keep_this_also']
If you want to keep only strings that start with keep
(or an other substring), simply use the pattern keep\\w*
instead of \\w+
:
In [4]: re.findall('keep\w*', "keep, ignore, keep_this_too, ignore, keep_this_also, ignore")
Out[4]: ['keep', 'keep_this_too', 'keep_this_also']
If what you are trying to match is not really a word, ie it can contain characters such as spaces, punctuation etc., then you can replace \\w
with [^,]
in the regexes above to match everything except the comma.
You could use something like:
import re
re.findall("([^,]*), [^,]+[,]{0,1}", "keep, ignore, keep_this_too, ignore, keep_this_also, ignore")
But why not just use split and slice the result:
"keep, ignore, keep_this_too, ignore, keep_this_also, ignore".split(",")[0::2]
You need this:
s = ' keep, ignore, keep_this_too , ignore, keep_this_also, ignore '
print(s.replace(' ','').split(',')[0::2])
yields:
['keep', 'keep_this_too', 'keep_this_also']
this?
>>> s = "keep, ignore, keep_this_too, ignore, keep_this_also, ignore"
>>> import re
>>> re.findall(r'(\w+)\W+\w+', s)
['keep', 'keep_this_too', 'keep_this_also']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.