Finding repeats in multiple lists read from CSV File (Python)

Question

Title seems confusing, but let's say I'm working with the following CSV file ('names.csv').

    name1,name2,name3
    Bob,Jane,Joe
    Megan,Tom,Jane
    Jane,Joe,Rob

My question is, how would I go about making code that returns the string that occurs at least 3 times. So the output should be 'Jane', because that occurs at least 3 times. Really confused here.. perhaps some sample code would help me better understand?

So far I have:

    import csv
    reader = csv.DictReader(open("names.csv"))

    for row in reader:
        names = [row['name1'], row['name2'], row['name3']]
        print names

This returns:

    ['Bob', 'Jane', 'Joe']
    ['Megan', 'Tom', 'Jane']
    ['Jane', 'Joe', 'Rob']

Where do I go from here? Or am I going about this wrong? I'm really new to Python (well, programming altogether), so I have close to no clue what I'm doing..

Cheers

Answer 1

I'd do it like this:

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> rows = [['Bob', 'Jane', 'Joe'],
... ['Megan', 'Tom', 'Jane'],
... ['Jane', 'Joe', 'Rob']]
...
>>> for row in rows:
...     for name in row:
...         d[name] += 1
... 
>>> filter(lambda x: x[1] >= 3, d.iteritems())
[('Jane', 3)]

It uses dict with default value of 0 to count how many times each name happens in the file, and then it filters the dict with according condition (count >= 3).

Answer 2

Putting it altogether (and showing proper csv.reader usage):

import csv
import collections
d = collections.defaultdict(int)
with open("names.csv", "rb") as f: # Python 3.x: use newline="" instead of "rb"
    reader = csv.reader(f):
    reader.next() # ignore useless heading row
    for row in reader:
        for name in row:
            name = name.strip()
            if name:
                d[name] += 1
 morethan3 = [(name, count) for name, count in d.iteritems() if count >= 3]
 morethan3.sort(key=lambda x: x[1], reverse=True)
 for name, count in morethan3:
    print name, count

Update in response to comment:

You need to read through the whole CSV file whether you use the DictReader approach or not. If you want to eg ignore the 'name2' column ( not row ), then ignore it. You don't need to save all the data as your use of the variable name "rows" suggests. Here is code for a more general approach that doesn't rely on the column headings being in a particular order and allows selection/rejection of particular columns.

    reader = csv.DictReader(f):
    required_columns = ['name1', 'name3'] #### adjust this line as needed ####
    for row in reader:
        for col in required_columns:
            name = row[col].strip()
            if name:
                d[name] += 1

Finding repeats in multiple lists read from CSV File (Python)

Question

2 answers

solution1
0 2011-05-07 08:37:07

solution2
0 ACCPTED 2011-05-07 11:15:26

Finding repeats in multiple lists read from CSV File (Python)

Question

2 answers

solution1 0 2011-05-07 08:37:07

solution2 0 ACCPTED 2011-05-07 11:15:26

solution1
0 2011-05-07 08:37:07

solution2
0 ACCPTED 2011-05-07 11:15:26