I'm trying to parse trough a file and search for a keyword in a list of strings. I need to return the 'n' characters before and after each occurrence. I have it working without regex but it's not very efficient. Any idea how to do the same with regex and findall? Lookup is a list of strings. This is what I have without regex:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
if string in line:
# Split the line in 2 substrings
tmp1 = line.split(string)[0]
tmp2 = line.split(string)[1]
# Truncate only 'n' characters before and after the keyword
tmp = tmp1[-n:] + string + tmp2[:n]
# Do something here...
This is the start with regex:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
# Regex search with Ignorecase
searchObj = re.findall(string, line, re.M | re.I)
if searchObj:
print "search --> : ", searchObj
# Loop trough searchObj and get n characters
From https://docs.python.org/2/library/re.html
start([group])
end([group])
Return the indices of the start and end of the substring matched by
group; group defaults to zero (meaning the whole matched substring).
Return -1 if group exists but did not contribute to the match. For a
match object m, and a group g that did contribute to the match, the
substring matched by group g (equivalent to m.group(g)) is
m.string[m.start(g):m.end(g)]
Note that m.start(group) will equal m.end(group) if group matched a
null string. For example, after m = re.search('b(c?)', 'cba'),
m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both
2, and m.start(2) raises an IndexError exception.
Using re.finditer
you can generate an iterator of MatchObject
and then use these attributes to get the start and end of your substrings.
I got it to work. Below is the code if anyone needs it:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
# Regex
searchObj = re.finditer(string, line, re.M | re.I)
if searchObj:
for match in searchObj:
# Find the start index of the keyword
start = match.span()[0]
# Find the end index of the keyword
end = match.span()[1]
# Truncate line to get only 'n' characters before and after the keyword
tmp = line[start-n:end+n] + '\n'
print tmp
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.