[英]Python regex to get n characters before and after a keyword in a line of text
I'm trying to parse trough a file and search for a keyword in a list of strings. 我试图解析低谷的文件,并在字符串列表中搜索关键字。 I need to return the 'n' characters before and after each occurrence.
我需要在每次出现之前和之后返回“ n”个字符。 I have it working without regex but it's not very efficient.
我没有正则表达式就可以工作,但是效率不是很高。 Any idea how to do the same with regex and findall?
任何想法如何使用正则表达式和findall一样吗? Lookup is a list of strings.
查找是字符串列表。 This is what I have without regex:
这是我没有正则表达式的情况:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
if string in line:
# Split the line in 2 substrings
tmp1 = line.split(string)[0]
tmp2 = line.split(string)[1]
# Truncate only 'n' characters before and after the keyword
tmp = tmp1[-n:] + string + tmp2[:n]
# Do something here...
This is the start with regex: 这是从正则表达式开始的:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
# Regex search with Ignorecase
searchObj = re.findall(string, line, re.M | re.I)
if searchObj:
print "search --> : ", searchObj
# Loop trough searchObj and get n characters
From https://docs.python.org/2/library/re.html 从https://docs.python.org/2/library/re.html
start([group])
end([group])
Return the indices of the start and end of the substring matched by
group; group defaults to zero (meaning the whole matched substring).
Return -1 if group exists but did not contribute to the match. For a
match object m, and a group g that did contribute to the match, the
substring matched by group g (equivalent to m.group(g)) is
m.string[m.start(g):m.end(g)]
Note that m.start(group) will equal m.end(group) if group matched a
null string. For example, after m = re.search('b(c?)', 'cba'),
m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both
2, and m.start(2) raises an IndexError exception.
Using re.finditer
you can generate an iterator of MatchObject
and then use these attributes to get the start and end of your substrings. 使用
re.finditer
您可以生成的迭代器MatchObject
,然后使用这些属性让你的子串的开始和结束。
I got it to work. 我知道了。 Below is the code if anyone needs it:
下面是如果有人需要的代码:
with open(file, 'r') as temp:
for num, line in enumerate(temp, 1):
for string in lookup:
# Regex
searchObj = re.finditer(string, line, re.M | re.I)
if searchObj:
for match in searchObj:
# Find the start index of the keyword
start = match.span()[0]
# Find the end index of the keyword
end = match.span()[1]
# Truncate line to get only 'n' characters before and after the keyword
tmp = line[start-n:end+n] + '\n'
print tmp
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.