[英]match 2 strings exactly except at places where there is a particular string in python
我有一個包含某些文本的主文件-假設-
file contains x
the image is of x type
the user is admin
the address is x
然后還有200個其他文本文件,其中包含類似以下內容的文本:
file contains xyz
the image if of abc type
the user is admin
the address if pqrs
我需要匹配這些文件。 如果文件包含的文本與主文件中的文本完全相同,則結果為true,每個文件的x都不同,即,主文件中的“ x”可以是其他文件中的任何內容,並且結果為true。想到的是
arr=master.split('\n')
for file in files:
a=[]
file1=file.split('\n')
i=0
for line in arr:
line_list=line.split()
indx=line_list.index('x')
line_list1=line_list[:indx]+line_list[indx+1:]
st1=' '.join(line_list1)
file1_list=file1[i].split()
file1_list1=file1_list[:indx]+file1_list[indx+1:]
st2=' '.join(file1_list1)
if st1!=st2:
a.append(line)
i+=1
這是非常低效的。 有沒有一種方法可以將文件與主文件映射,並在其他文件中生成差異?
我知道這不是真正的解決方案,但是您可以通過以下方式檢查文件是否為相同格式:
if "the image is of" in var:
to do
通過檢查其余的行
“文件包含”
“用戶是”
“地址是”
您將能夠在某種程度上驗證所檢查的文件是否有效
您可以檢查此鏈接以了解有關此“子字符串概念”的更多信息
那“通用”在生產線上是獨一無二的嗎? 例如,如果鍵確實是x
,是否可以保證x
在行中沒有其他地方出現? 或者主文件可能有類似
excluding x records and x axis values
如果你有一個獨特的鑰匙......
對於每一行,請在密鑰x
上拆分主文件。 這為您提供了兩條線,正面和背面。 然后僅僅是檢查線路是否startswith
前部和endswith
后面的部分。 就像是
for line in arr:
front, back = line.split(x_key)
# grab next line in input file
...
if line_list1.startswith(front) and
line_list1.endswith(back):
# process matching line
else:
# process non-matching line
參閱文件
每個操作更新
只要x
在行中是唯一的,您就可以輕松調整它。 正如您在評論中提到的那樣,您想要類似
if len(line) == len(line_list1):
if all(line[i] == line_list1[i] for i in len(line) ):
# Found matching lines
else:
# Advance to the next line
我認為這是一種可以滿足您要求的方法。 它還允許您指定在每行上是否只允許相同的差異(這將您的第二個文件示例視為不匹配):
更新:這說明了主文件和其他文件中的行不一定相同的順序
from itertools import zip_longest
def get_min_diff(master_lines, to_check):
min_diff = None
match_line = None
for ln, ml in enumerate(master_lines):
diff = [w for w, m in zip_longest(ml, to_check) if w != m]
n_diffs = len(diff)
if min_diff is None or n_diffs < min_diff:
min_diff = n_diffs
match_line = ln
return min_diff, diff, match_line
def check_files(master, files):
# get lines to compare against
master_lines = []
with open(master) as mstr:
for line in mstr:
master_lines.append(line.strip().split())
matches = []
for f in files:
temp_master = list(master_lines)
diff_sizes = set()
diff_types = set()
with open(f) as checkfile:
for line in checkfile:
to_check = line.strip().split()
# find each place in current line where it differs from
# the corresponding line in the master file
min_diff, diff, match_index = get_min_diff(temp_master, to_check)
if min_diff <= 1: # acceptable number of differences
# remove corresponding line from master search space
# so we don't match the same master lines to multiple
# lines in a given test file
del temp_master[match_index]
# if it only differs in one place, keep track of what
# word was different for optional check later
if min_diff == 1:
diff_types.add(diff[0])
diff_sizes.add(min_diff)
# if you want any file where the max number of differences
# per line was 1
if max(diff_sizes) == 1:
# consider a match if there is only one difference per line
matches.append(f)
# if you instead want each file to only
# be different by the same word on each line
#if len(diff_types) == 1:
#matches.append(f)
return matches
根據您提供的示例,我已經制作了一些測試文件以供檢查:
::::::::::::::
test1.txt
::::::::::::::
file contains y
the image is of y type
the user is admin
the address is y
::::::::::::::
test2.txt
::::::::::::::
file contains x
the image is of x type
the user is admin
the address is x
::::::::::::::
test3.txt
::::::::::::::
file contains xyz
the image is of abc type
the user is admin
the address is pqrs
::::::::::::::
testmaster.txt
::::::::::::::
file contains m
the image is of m type
the user is admin
the address is m
::::::::::::::
test_nomatch.txt
::::::::::::::
file contains y and some other stuff
the image is of y type unlike the other
the user is bongo the clown
the address is redacted
::::::::::::::
test_scrambled.txt
::::::::::::::
the image is of y type
file contains y
the address is y
the user is admin
運行時,以上代碼返回正確的文件:
In: check_files('testmaster.txt', ['test1.txt', 'test2.txt', 'test3.txt', 'test_nomatch.txt', 'test_scrambled.txt'])
Out: ['test1.txt', 'test2.txt', 'test3.txt', 'test_scrambled.txt']
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.