简体   繁体   English

Python正则表达式在一行文本中的关键字前后获得n个字符

[英]Python regex to get n characters before and after a keyword in a line of text

I'm trying to parse trough a file and search for a keyword in a list of strings. 我试图解析低谷的文件,并在字符串列表中搜索关键字。 I need to return the 'n' characters before and after each occurrence. 我需要在每次出现之前和之后返回“ n”个字符。 I have it working without regex but it's not very efficient. 我没有正则表达式就可以工作,但是效率不是很高。 Any idea how to do the same with regex and findall? 任何想法如何使用正则表达式和findall一样吗? Lookup is a list of strings. 查找是字符串列表。 This is what I have without regex: 这是我没有正则表达式的情况:

with open(file, 'r') as temp:
    for num, line in enumerate(temp, 1):
        for string in lookup:
            if string in line:

                # Split the line in 2 substrings
                tmp1 = line.split(string)[0]
                tmp2 = line.split(string)[1]

                # Truncate only 'n' characters before and after the keyword
                tmp = tmp1[-n:] + string + tmp2[:n]

                # Do something here...

This is the start with regex: 这是从正则表达式开始的:

with open(file, 'r') as temp:
    for num, line in enumerate(temp, 1):
        for string in lookup:
            # Regex search with Ignorecase
            searchObj = re.findall(string, line, re.M | re.I)

            if searchObj:
                print "search --> : ", searchObj

                # Loop trough searchObj and get n characters 

From https://docs.python.org/2/library/re.html https://docs.python.org/2/library/re.html

start([group])
end([group])
   Return the indices of the start and end of the substring matched by 
   group; group defaults to zero (meaning the whole matched substring). 
   Return -1 if group exists but did not contribute to the match. For a 
   match object m, and a group g that did contribute to the match, the 
   substring matched by group g (equivalent to m.group(g)) is


    m.string[m.start(g):m.end(g)]

    Note that m.start(group) will equal m.end(group) if group matched a 
    null string. For example, after m = re.search('b(c?)', 'cba'), 
    m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both    
    2, and m.start(2) raises an IndexError exception.

Using re.finditer you can generate an iterator of MatchObject and then use these attributes to get the start and end of your substrings. 使用re.finditer您可以生成的迭代器MatchObject ,然后使用这些属性让你的子串的开始和结束。

I got it to work. 我知道了。 Below is the code if anyone needs it: 下面是如果有人需要的代码:

with open(file, 'r') as temp:
    for num, line in enumerate(temp, 1):
        for string in lookup:

            # Regex
            searchObj = re.finditer(string, line, re.M | re.I)

            if searchObj:
                for match in searchObj:

                    # Find the start index of the keyword
                    start = match.span()[0]

                    # Find the end index of the keyword
                    end = match.span()[1]

                    # Truncate line to get only 'n' characters before and after the keyword
                    tmp = line[start-n:end+n] + '\n'            
                    print tmp

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM