[英]Counting the number of character differences between two files
I have two somewhat large (~20 MB) txt files which are essentially just long strings of integers (only either 0,1,2). 我有两个较大的(〜20 MB)txt文件,它们实际上只是长整数字符串(仅0,1,2)。 I would like to write a python script which iterates through the files and compares them integer by integer. 我想编写一个Python脚本,该脚本遍历文件并逐整数比较它们。 At the end of the day I want the number of integers that are different and the total number of integers in the files (they should be exactly the same length). 在一天结束时,我想要不同的整数数量和文件中的整数总数(它们的长度应完全相同)。 I have done some searching and it seems like difflib may be useful but I am fairly new to python and I am not sure if anything in difflib will count the differences or the number of entries. 我已经进行了一些搜索,似乎difflib可能有用,但是我对python还是很陌生,我不确定difflib中是否有任何内容可以计算差异或条目数。
Any help would be greatly appreciated! 任何帮助将不胜感激! What I am trying right now is the following but it only looks at one entry and then terminates and I don't understand why. 我现在正在尝试的是以下内容,但它仅查看一个条目,然后终止,我不明白为什么。
f1 = open("file1.txt", "r")
f2 = open("file2.txt", "r")
fileOne = f1.readlines()
fileTwo = f2.readlines()
f1.close()
f2.close()
correct = 0
x = 0
total = 0
for i in fileOne:
if i != fileTwo[x]:
correct +=1
x += 1
total +=1
if total != 0:
percent = (correct / total) * 100
print "The file is %.1f %% correct!" % (percent)
print "%i out of %i symbols were correct!" % (correct, total)
Not tested at all, but look at this as something a lot easier (and more Pythonic): 根本没有经过测试,但是将其视为更简单(和更多Pythonic)的东西:
from itertools import izip
with open("file1.txt", "r") as f1, open("file2.txt", "r") as f2:
data=[(1, x==y) for x, y in izip(f1.read(), f2.read())]
print sum(1.0 for t in data if t[1]) / len(data) * 100
You can use enumerate
to check the chars in your strings that don't match 您可以使用enumerate
检查字符串中不匹配的字符
If all strings are guaranteed to be the same length: 如果所有字符串都保证长度相同:
with open("file1.txt","r") as f:
l1 = f.readlines()
with open("file2.txt","r") as f:
l2 = f.readlines()
non_matches = 0.
total = 0.
for i,j in enumerate(l1):
non_matches += sum([1 for k,l in enumerate(j) if l2[i][k]!= l]) # add 1 for each non match
total += len(j.split(","))
print non_matches,total*2
print non_matches / (total * 2) * 100. # if strings are all same length just mult total by 2
6 40
15.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.