简体   繁体   English

python 比较大文件中的行

[英]python compare rows in big files

I need to compare two.csv files (files are over 65000 lines).我需要比较两个.csv 文件(文件超过 65000 行)。 Find lines that are not in the second file.查找不在第二个文件中的行。 I am using difflib.ndiff:我正在使用 difflib.ndiff:

for line in difflib.ndiff(text1, text2):
    print(line,)

But I get unexpected results.但我得到了意想不到的结果。 The function finds two identical strings and marks them as different: function 找到两个相同的字符串并将它们标记为不同:

+ Gr4,DQ_3Gb_1m_DR_926_23489,100,,,70,,
- Gr4,DQ_3Gb_1m_DR_926_23489,100,,,70,,
  1. What could be the problem?可能是什么问题呢?
  2. What might be a suitable way to find the differences?什么可能是找到差异的合适方法?

2. 2.

from itertools import izip_longest
l1 = map(lambda x: x.strip(), list(open('test1.txt')))
l2 = map(lambda x: x.strip(), list(open('test2.txt')))
diff_list = izip_longest(l1, l2)
for diff in diff_list:
    print '%s %s %s' % (
        diff[0] or '', 
        '==' if diff[0] == diff[1] else '!=',
        diff[1] or '',
    )

I tried to use the following code to compare files, but I got the same unexpected result, why is this so?我尝试使用以下代码来比较文件,但我得到了同样的意外结果,为什么会这样?

Gr4,DQ_1Gb_1m_DR_926_23486,100,,,70,,!=Gr4,DQ_3Gb_1m_DR_926_23489,100,,,70,,
Gr4,DQ_3Gb_1m_DR_926_23489,100,,,70,,!=Gr4,DQ_1Gb_1m_DR_926_23486,100,,,70,,

This is easy when you use pandas.当您使用 pandas 时,这很容易。 Since you're not provided the dataset.由于您没有提供数据集。 I'll use my own.我会用我自己的。

Assume, i've two csv's.假设,我有两个 csv。

在此处输入图像描述

Data looks like this:数据如下所示:

在此处输入图像描述

Now print line, that is not present in second file (benz model in not present in second file):现在打印第二个文件中不存在的行(第二个文件中不存在 benz model):

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM