How do I delete rows in one CSV based on rows in another CSV using specific columns to compare

Question

I have two CSV files, CSV1 and CSV2 , with multiple columns and rows. One of the columns is titled ID .

What I want to do is check CSV1 's rows and if there is an ID in CSV2 's ID column that matches CSV1 's rows ID , then I want to keep that row. If there's no match, then I want to delete the row from CSV1 .

Basically the data in CSV1 is relevant to me, but only the people that are in CSV2 . The unique connection between the two is the ID column. So I need to check all the rows in CSV1 to see if that Row ID is in a row in CSV2 .

Here's what I have so far.

import csv
smarteeCSV = open("Smartee.csv", "r")
aeriesCSV = open("aeriesEditable.csv", "r+")

aeries = csv.reader(aeriesCSV, delimiter=',')##CSV1
smartee = csv.reader(smarteeCSV, delimiter=',')##CSV2    

for row in aeries:
    for item in smartee
    if row[1] != item[1]##indexes for the columns with the ids

I can already tell I'm not on the right track, so can anyone assist?

Answer 1

You could extract all of the ID's in the second file and look those up every time you check one of the rows of the first file.

For example:

# extract ID column from CSV file 2 into a set
Ids = { row[1] for row in smartee }

# pick only rows whose ID is in Ids 
filtered_rows = [item for item in aeries if item[1] in Ids]

Answer 2

Depending on what you plan on doing with the relevant rows of data, you might be able to use Python's built-in filter() function to do what you need:

import csv

# first get the ids    
with open('Smartee.csv', 'rb') as smarteeCSV:  # CSV2
    ids = set(row['ID'] for row in csv.DictReader(smarteeCSV, delimiter=','))

with open('aeriesEditable.csv', 'rb') as aeriesCSV:  # CSV1
    relevant = filter(lambda row: if row['ID'] in ids,
                        csv.DictReader(aeriesCSV, delimiter=','))

# relevant will be a list containing the desired rows from CSV1

If you want to iteratively process the rows, for the second part you could use a for loop over the results of calling the itertools.ifilter() function similarly instead.

Answer 3

First, read CSV2 to make just a set of IDs:

with open(CSV2) as f:
    r = csv.DictReader(f)
    theids = set(row['ID'] for row in r)

Then, while reading CSV1, just check if the ID is in the set:

with open(CSV1) as f, open(CSV1 + '.new', 'w') as out:
    r = csv.DictReader(r)
    w = csv.DictWriter(out, r.fieldnames)
    for row in r:
        if row['ID'] in theids:
            w.writerow(row)

This assumes the CSV files are suitable for dict-based read/write (ie first row is a list of column names), but is easy to tweak if the col names come from other information too.

How do I delete rows in one CSV based on rows in another CSV using specific columns to compare

Question

3 answers

solution1
3 ACCPTED 2015-01-14 22:28:42

solution2
1 2015-01-14 23:50:04

solution3
0 2015-01-14 22:37:36

How do I delete rows in one CSV based on rows in another CSV using specific columns to compare

Question

3 answers

solution1 3 ACCPTED 2015-01-14 22:28:42

solution2 1 2015-01-14 23:50:04

solution3 0 2015-01-14 22:37:36

solution1
3 ACCPTED 2015-01-14 22:28:42

solution2
1 2015-01-14 23:50:04

solution3
0 2015-01-14 22:37:36