简体   繁体   中英

Merge 2 csv files with python

I have 2 csv files as following:

File1.csv:

Name, Email
Jon, jon@email.com
Roberto, roberto@email.com
Mona, mona@email.com
James, james@email.com

File2.csv:

Email
mona@email.com
james@email.com

What I want is File1.csv without File2.csv, iex File3.csv (the output) should look as following:

File3.csv:

Name, Email
Jon, jon@email.com
Roberto, roberto@email.com

What is the simplest way to code this in Python?

dont_need_em = []
with open("file2.csv", 'r') as fn:
    for line in fn:
        if not line.startswith("Email"):
            dont_need_em.append(line.rstrip())

fw = open("file3.csv", 'w')

with open("file1.csv", 'r') as fn:
    for line in fn:
        if line.rstrip().split(", ")[1] not in dont_need_em: 
            fw.write(line.rstrip())
fw.close()

This should do it, but i am sure there are way simpler solutions

EDIT: Create the third file

Using Pandas you can do this:

import pandas as pd
#Read two files into data frame using column names from first row
file1=pd.read_csv('File1.csv',header=0,skipinitialspace=True)
file2=pd.read_csv('File2.csv',header=0,skipinitialspace=True)

#Only return lines in file 1 if the email is not contained in file 2
cleaned=file1[~file1["Email"].isin(file2["Email"])]

#Output file to CSV with original headers
cleaned.to_csv("File3.csv", index=False)

Here's a good way to do that (it's very similar to the above, but writes the remainder to file rather than printing:

Removed = []
with open("file2.csv", 'r') as f2:
    for line in f2:
        if not line.startswith("Email"):
           removed.append(line.rstrip())


with open("file1.csv", 'r') as f1:
    with open("file3.csv", 'w') as f3:
        for line in f1:
            if line.rstrip().split(", ")[1] not in removed:
                f3.write(line)

How this works: The first block reads all the emails you want to filter out into a list. Next, the second block opens your original file and sets up a new file to write what's left. It reads each line from your first file and writes them to the third file only if the email isn't in your list to filter

If you are under UNIX:

#! /usr/bin/env python
import subprocess
import sys

def filter(input_file, filter_file, out_file):
    subprocess.call("grep -f '%s' '%s' > '%s' " % (filter_file, input_file, out_file), shell=True)

The following should do what you are looking for. First read File2.csv into a set of email addresses to be skipped. Then read File1.csv row by row, writing only rows which are not in the skip list:

import csv

with open('File2.csv', 'r') as file2:
    skip_list = set(line.strip() for line in file2.readlines()[1:])

with open('File1.csv', 'rb') as file1, open('File3.csv', 'wb') as file3:
    csv_file1 = csv.reader(file1, skipinitialspace=True)
    csv_file3 = csv.writer(file3)
    csv_file3.writerow(next(csv_file1))    # Write the header line

    for cols in csv_file1:
        if cols[1] not in skip_list:
            csv_file3.writerow(cols)

This would give you the following output in File3.csv :

Name,Email
Jon,jon@email.com
Roberto,roberto@email.com

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM