[英]Using python (acora) to find lines containing keywords
我正在編寫一個程序,該程序讀取文本文件目錄並找到重疊的字符串的特定組合(即在所有文件之間共享)。 我目前的方法是從這個目錄中取出一個文件,解析它,構建每個字符串組合的列表,然后在其他文件中搜索這個字符串組合。 例如,如果我有十個文件,我會讀取一個文件,解析它,存儲我需要的關鍵字,然后搜索其他九個文件以查找這個組合。 我會為每個文件重復此操作(確保單個文件不會自行搜索)。 為此,我正在嘗試使用 python 的acora模塊。
我到目前為止的代碼是:
def match_lines(f, *keywords):
"""Taken from [https://pypi.python.org/pypi/acora/], FAQs and Recipes #3."""
builder = AcoraBuilder('\r', '\n', *keywords)
ac = builder.build()
line_start = 0
matches = False
for kw, pos in ac.filefind(f): # Modified from original function; search a file, not a string.
if kw in '\r\n':
if matches:
yield f[line_start:pos]
matches = False
line_start = pos + 1
else:
matches = True
if matches:
yield f[line_start:]
def find_overlaps(f_in, fl_in, f_out):
"""f_in: input file to extract string combo from & use to search other files.
fl_in: list of other files to search against.
f_out: output file that'll have all lines and file names that contain the matching string combo from f_in.
"""
string_list = build_list(f_in) # Open the first file, read each line & build a list of tuples (string #1, string #2). The "build_list" function isn't shown in my pasted code.
found_lines = [] # Create a list to hold all the lines (and file names, from fl_in) that are found to have the matching (string #1, string #2).
for keywords in string_list: # For each tuple (string #1, string #2) in the list of tuples
for f in fl_in: # For each file in the input file list
for line in match_lines(f, *keywords):
found_lines.append(line)
正如您可能知道的那樣,我使用了match_lines
網頁“FAQ and recipes”#3 中的函數match_lines
。 我還在模式中使用它來解析文件(使用ac.filefind()
),也位於網頁中。
該代碼似乎有效,但它只為我提供具有匹配字符串組合的文件名。 我想要的輸出是從包含我匹配的字符串組合(元組)的其他文件中寫出整行。
我沒有看到這里會產生什么文件名,正如你所說的那樣。
無論如何,要獲得行號,您只需要在 match_lines() 中傳遞它們時計算它們:
line_start = 0
line_number = 0
matches = False
text = open(f, 'r').read()
for kw, pos in ac.filefind(f): # Modified from original function; search a file, not a string.
if kw in '\r\n':
if matches:
yield line_number, text[line_start:pos]
matches = False
line_start = pos + 1
line_number += 1
else:
matches = True
if matches:
line_number, yield text[line_start:]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.