简体   繁体   English

在python中具有匹配关键字的文件中搜索和返回行

[英]Search and return lines in a file with matching keyword in python

I have used regex to isolate a particular keyword in a line taken from a file. 我已经使用正则表达式来隔离文件中一行中的特定关键字。 I want to search the entire file and return groups of lines which have the same keyword. 我想搜索整个文件并返回具有相同关键字的行组。

I am a little confused about this, and I was wondering if there is a direct regex way to do this in Python ? 我对此有些困惑,我想知道是否有直接的正则表达式在Python中执行此操作?

eg - > 例如->

My file may look like this 我的档案可能像这样

1  0001    1   UG  science,ee;YEAR=onefour;standard->2;district->9
2  0002    1   UG  science,cs;YEAR=onefive;standard->1;district->9
3  0012    2   UG  science,eng;YEAR=onefour;standard->3;district->4
4  0021    2   UG  science,ee;YEAR=onetwo;standard->2;district->9
5  0056    4   UG  science,cs;YEAR=onefive;standard->1;district->8
6  0145    3   UG  science,eng;YEAR=onetwo;standard->4;district->2

I used regex to extract 我用正则表达式提取

"YEAR=****" 

and want to group the lines according to the value of 并希望根据的值对行进行分组

"****"

The output should look like this - 输出应如下所示:

1  0001    1   UG  science,ee;YEAR=onefour;standard->2;district->9
3  0012    2   UG  science,eng;YEAR=onefour;standard->3;district->4

2  0002    1   UG  science,cs;YEAR=onefive;standard->1;district->9
5  0056    4   UG  science,cs;YEAR=onefive;standard->1;district->8

4  0021    2   UG  science,ee;YEAR=onetwo;standard->2;district->9
6  0145    3   UG  science,eng;YEAR=onetwo;standard->4;district->2

I believe I can do it the long way of opening file, storing in dictionaries and matching, etc, etc. But would like to know if there is short concise way of doing this. 我相信我可以做到很长的打开文件,保存在字典中和进行匹配等操作,但是我想知道是否有简短的简洁方法。

as requested - a bit of code I tried to write and run - 根据要求-我尝试编写并运行一些代码-

#!/usr/bin/python

import re

##open file and read each line of file

dfile = open("datafile.txt","r")

##regex to find YEAR in entry and return YEAR

regex_unique = re.compile(r'(?<=\bYEAR=)[^;]+')

list_Name =[]

for line in dfile:
    match1 = re.search(regex_unique,line)
    if match1:
        if match1.group(0) not in list_Name:
        list_Name.append(match1.group(0))


## print (list_Name)

for item in list_Name:
for line in dfile:
    match2 = re.search(item,line)
    if match2:
        print (match2)

the last bit does not seem to work - I am assuming if I give 最后一点似乎不起作用-我假设如果我给

item

to

re.search

it should search for that word in the entire file - Now I think I might have to add some wildcard entries before and after the actual word to make it work. 它应该在整个文件中搜索该单词-现在,我想我可能必须在实际单词之前和之后添加一些通配符条目,以使其起作用。

I think I'm right in saying that regex only deals with matches on lines, not with how to aggregate matches -- so you'll need to do that yourself. 我认为我说对了,正则表达式仅处理在线匹配,而不涉及如何汇总匹配-因此您需要自己进行操作。 You can keep things simple by writing your own utility function and keeping it separate from you application code. 您可以通过编写自己的实用程序函数并将其与应用程序代码分开来简化事情。

Grouping operations in general must pass over all of the items to assemble the groups. 通常,分组操作必须传递所有项目才能组合成组。 Your problem can't be solved without making a pass over all of the data to collect the groups, and then another pass to output the groups. 如果不对所有数据进行传递以收集组,然后再次传递输出组,就无法解决您的问题。

A dictionary of lists is the natural datastructure to collect a each line by by a key (as you note). 列表字典是一种自然的数据结构,可以通过一个键(如您所述)来收集每一行。 Doing this yourself set is a little kludgey, since you frequently need to test whether a key exists to know whether you should add to an existing list or create a new one. 自己设置此设置有点麻烦,因为您经常需要测试某个键是否存在才能知道您应该添加到现有列表还是创建新列表。 Fortunately, python provides defaultdict, that lets you: 幸运的是,python提供了defaultdict,使您可以:

from collections import defaultdict
>>> d = defaultdict(list)
>>> d[key].append(line)

So, you can do the following: 因此,您可以执行以下操作:

def groupLinesByMatch(filename,regex):
    import re
    from collections import defaultdict

    regex = re.compile(regex)
    result = defaultdict(list)

    for line in open(filename).readlines():
        matches = regex.match(line)
        if matches:    
            result[matches.group(1)].append( line )

    return result.values()


for lines in groupLinesByMatch(filename, regex):
    for line in lines:
        print line,
    print

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM