简体   繁体   English

在从CSV文件(Python)读取的多个列表中查找重复

[英]Finding repeats in multiple lists read from CSV File (Python)

Title seems confusing, but let's say I'm working with the following CSV file ('names.csv'). 标题似乎令人困惑,但是可以说我正在使用以下CSV文件(“ names.csv”)。

    name1,name2,name3
    Bob,Jane,Joe
    Megan,Tom,Jane
    Jane,Joe,Rob

My question is, how would I go about making code that returns the string that occurs at least 3 times. 我的问题是,我将如何制作返回至少出现3次的字符串的代码。 So the output should be 'Jane', because that occurs at least 3 times. 因此输出应为“简”,因为这至少发生3次。 Really confused here.. perhaps some sample code would help me better understand? 这里真的很困惑。也许一些示例代码可以帮助我更好地理解?

So far I have: 到目前为止,我有:

    import csv
    reader = csv.DictReader(open("names.csv"))

    for row in reader:
        names = [row['name1'], row['name2'], row['name3']]
        print names

This returns: 返回:

    ['Bob', 'Jane', 'Joe']
    ['Megan', 'Tom', 'Jane']
    ['Jane', 'Joe', 'Rob']

Where do I go from here? 我从这里去哪里? Or am I going about this wrong? 还是我要解决这个错误? I'm really new to Python (well, programming altogether), so I have close to no clue what I'm doing.. 我真的是Python的新手(嗯,完全是编程),所以我几乎不知道我在做什么。

Cheers 干杯

I'd do it like this: 我会这样:

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> rows = [['Bob', 'Jane', 'Joe'],
... ['Megan', 'Tom', 'Jane'],
... ['Jane', 'Joe', 'Rob']]
...
>>> for row in rows:
...     for name in row:
...         d[name] += 1
... 
>>> filter(lambda x: x[1] >= 3, d.iteritems())
[('Jane', 3)]

It uses dict with default value of 0 to count how many times each name happens in the file, and then it filters the dict with according condition (count >= 3). 它使用默认值为0的dict来计算每个名称在文件中出现的次数,然后根据条件(计数> = 3)过滤dict。

Putting it altogether (and showing proper csv.reader usage): 放在一起(并显示正确的csv.reader用法):

import csv
import collections
d = collections.defaultdict(int)
with open("names.csv", "rb") as f: # Python 3.x: use newline="" instead of "rb"
    reader = csv.reader(f):
    reader.next() # ignore useless heading row
    for row in reader:
        for name in row:
            name = name.strip()
            if name:
                d[name] += 1
 morethan3 = [(name, count) for name, count in d.iteritems() if count >= 3]
 morethan3.sort(key=lambda x: x[1], reverse=True)
 for name, count in morethan3:
    print name, count

Update in response to comment: 更新以回应评论:

You need to read through the whole CSV file whether you use the DictReader approach or not. 无论是否使用DictReader方法,都需要通读整个CSV文件。 If you want to eg ignore the 'name2' column ( not row ), then ignore it. 例如,如果要忽略“ name2”列( 而不是row ),则忽略它。 You don't need to save all the data as your use of the variable name "rows" suggests. 您不需要像使用变量名“ rows”那样保存所有数据。 Here is code for a more general approach that doesn't rely on the column headings being in a particular order and allows selection/rejection of particular columns. 这是一种更通用方法的代码,该方法不依赖于特定顺序的列标题,并且允许选择/拒绝特定列。

    reader = csv.DictReader(f):
    required_columns = ['name1', 'name3'] #### adjust this line as needed ####
    for row in reader:
        for col in required_columns:
            name = row[col].strip()
            if name:
                d[name] += 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM