简体   繁体   中英

Multiple contingent iteration lists through csv file - Python

I'm on a Windows 7 x64 workstation running Python 2.7.3.

I have a CSV file containing rows of item IDs, with each row belonging to a group ID, like so:

GroupID ItemID
a   1
a   2
a   3
b   4
b   5
b   6
c   7
c   8
c   9
etc…    

What I need to do is generate a list of tuples, wherein each tuple is a string of the GroupID and a list of each ItemID associated with the GroupID, like so:

[('a', [1, 2, 3]), ('b', [4, 5, 6]), ('c', [7 , 8, 9])]

So far I've thought of using a function or list to set conversion to de-duplicate the GroupID column, then doing some sort of comparison if statement on a second loop through. Could anyone give me some advice please? Thanks!

You are looking for itertools.groupby() :

Make an iterator that returns consecutive keys and groups from the iterable. The key is a function computing a key value for each element. If not specified or is None, key defaults to an identity function and returns the element unchanged. Generally, the iterable needs to already be sorted on the same key function.

For example:

import csv
from itertools import groupby
from operator import itemgetter

with open("test.csv") as file:
    reader = csv.reader(file)
    next(reader) #Skip header
    data = groupby(reader, itemgetter(0))
    print([(key, [item for _, item in items]) for key, items in data])

We combine this with an operator.itemgetter() to say we want to group by the first item in the row, then we use a nested list comprehension to extract the data we want.

Which gives us:

[('a', ['1', '2', '3']), ('b', ['4', '5', '6']), ('c', ['7', '8', '9'])]

Naturally, unless you need a list, it is better to use a generator expression here to do the operation lazily. (We use a list comprehension here to get nice output).

Note that I assume your file is comma separated like you say, not as shown in your example. If it's tab separated, use csv.reader(file, dialect=csv.excel_tab) to parse it correctly.

If the grouping key is sequential, then something like:

from itertools import groupby
from operator import itemgetter

data = [('a', 1), ('a', 2), ('b', 3), ('b', 5)]

grouped = [(k, map(itemgetter(1), g)) for k, g in groupby(data, itemgetter(0))]
# [('a', [1, 2]), ('b', [3, 5])]

Otherwise, use a collections.defaultdict .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM