简体   繁体   English

检查多个 tsv 文件并从 python 中的每个 tsv 中删除所有相同的行

[英]check multiple tsv file and drop all the same rows from each tsv in python

i have three tsv files.我有三个 tsv 文件。

file 1:文件 1:

1   Alice   24      
10  Bill    23
4   Ellen   24
9   Mike    30

file 2:文件 2:


6  Julie   76
2  Bob     42
7  Tom     54
5  Frank   30
1  Alice   24

file 3:文件 3:

3  Dave    68
8  Jerry   34
1  Alice   24
5  Frank   30
2  Bob     42

OUTPUT: My desire output is to drop all the rows in which first and second column's values are the same from any of those tsv files and keep other rows as it is. OUTPUT:我希望 output 是从任何这些 tsv 文件中删除第一列和第二列值相同的所有行,并保持其他行不变。

file 1:文件 1:

10  Bill    23
9   Mike    30
4   Ellen   24

file 2:文件 2:

6  Julie   76
7  Tom     54

file 3:文件 3:

3  Dave    68
8  Jerry   34

And my tsv files are headless.而且我的 tsv 文件是无头的。 I have tried following code so far.到目前为止,我已经尝试过以下代码。

with open('file2.tsv') as check_file:
    check_set = set([row.split('\t')[0].strip().upper() for row in check_file])

with open('file1.tsv', 'r') as in_file, open('file3.tsv', 'w') as out_file:
    for line in in_file:
        if line.split('\t')[0].strip().upper() in check_set:
            out_file.write(line)

But i didnot got my desired three output files with this code.但是我没有用这个代码得到我想要的三个 output 文件。 Any help will be appreciated.任何帮助将不胜感激。 Thanks in advance.提前致谢。

You first need to read all your TSV files and count each occurrence of the first two columns.您首先需要阅读所有 TSV 文件并计算前两列的每次出现次数。 Python's Counter() can be used for this (which is based on a dictionary). Python 的Counter()可用于此(基于字典)。

Whilst reading each row in, save it in a data dictionary where the keys are the filenames and the contents are lists of the first two values along with the raw rows.在读取每一行时,将其保存在data字典中,其中键是文件名,内容是前两个值的列表以及原始行。 A defaultdict() is used to avoid having to add an entry if it doesn't already exist before appending a new entry. defaultdict()用于避免在添加新条目之前如果它不存在则必须添加条目。

After reading everything in, counts can now be used to determine if any given row has been seen only once, other values can be skipped over.读完所有内容后,现在可以使用counts来确定任何给定行是否只出现过一次,其他值可以跳过。

from collections import Counter, defaultdict

counts = Counter()      # hold counts of each first two value pairs
data = defaultdict(list)  # hold all data from all files

for tsv in ['file1.tsv', 'file2.tsv', 'file3.tsv']:
    with open(tsv) as f_tsv:
        for row in f_tsv:
            split = list(map(str.strip, row.split('\t')))
            key = tuple(split[:2])  # first and second column values
            counts[key] += 1
            data[tsv].append((key, row))

for tsv, key_rows in data.items():
    with open('x' + tsv, 'w') as f_tsv:
        for key, row in key_rows:
            if counts[key] == 1:
                f_tsv.write(row)

I would recommend you add print() statements to better understand what each of the variables holds, eg print(counts) and print(data)我建议您添加print()语句以更好地了解每个变量的含义,例如print(counts)print(data)

Note: take out the 'x' + when ready, this was added to write the output files to slightly different filenames to avoid overwriting the original files whilst testing.注意:准备好后取出'x' + ,这是为了将 output 文件写入稍微不同的文件名,以避免在测试时覆盖原始文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM