Python正则表达式在一行文本中的关键字前后获得n个字符

Question

我试图解析低谷的文件，并在字符串列表中搜索关键字。 我需要在每次出现之前和之后返回“ n”个字符。 我没有正则表达式就可以工作，但是效率不是很高。 任何想法如何使用正则表达式和findall一样吗？ 查找是字符串列表。 这是我没有正则表达式的情况：

with open(file, 'r') as temp:
    for num, line in enumerate(temp, 1):
        for string in lookup:
            if string in line:

                # Split the line in 2 substrings
                tmp1 = line.split(string)[0]
                tmp2 = line.split(string)[1]

                # Truncate only 'n' characters before and after the keyword
                tmp = tmp1[-n:] + string + tmp2[:n]

                # Do something here...

这是从正则表达式开始的：

with open(file, 'r') as temp:
    for num, line in enumerate(temp, 1):
        for string in lookup:
            # Regex search with Ignorecase
            searchObj = re.findall(string, line, re.M | re.I)

            if searchObj:
                print "search --> : ", searchObj

                # Loop trough searchObj and get n characters

Answer 1

从https://docs.python.org/2/library/re.html

start([group])
end([group])
   Return the indices of the start and end of the substring matched by 
   group; group defaults to zero (meaning the whole matched substring). 
   Return -1 if group exists but did not contribute to the match. For a 
   match object m, and a group g that did contribute to the match, the 
   substring matched by group g (equivalent to m.group(g)) is


    m.string[m.start(g):m.end(g)]

    Note that m.start(group) will equal m.end(group) if group matched a 
    null string. For example, after m = re.search('b(c?)', 'cba'), 
    m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both    
    2, and m.start(2) raises an IndexError exception.

使用re.finditer您可以生成的迭代器MatchObject ，然后使用这些属性让你的子串的开始和结束。

Answer 2

我知道了。 下面是如果有人需要的代码：

with open(file, 'r') as temp:
    for num, line in enumerate(temp, 1):
        for string in lookup:

            # Regex
            searchObj = re.finditer(string, line, re.M | re.I)

            if searchObj:
                for match in searchObj:

                    # Find the start index of the keyword
                    start = match.span()[0]

                    # Find the end index of the keyword
                    end = match.span()[1]

                    # Truncate line to get only 'n' characters before and after the keyword
                    tmp = line[start-n:end+n] + '\n'            
                    print tmp

Python正则表达式在一行文本中的关键字前后获得n个字符

问题描述

2 个解决方案

解决方案1
1 2015-10-26 01:57:44

解决方案2
-1 已采纳 2015-10-28 18:18:25

Python正则表达式在一行文本中的关键字前后获得n个字符

问题描述

2 个解决方案

解决方案1 1 2015-10-26 01:57:44

解决方案2 -1 已采纳 2015-10-28 18:18:25

解决方案1
1 2015-10-26 01:57:44

解决方案2
-1 已采纳 2015-10-28 18:18:25