I have 2 csv files as following:
File1.csv:
Name, Email
Jon, jon@email.com
Roberto, roberto@email.com
Mona, mona@email.com
James, james@email.com
File2.csv:
Email
mona@email.com
james@email.com
What I want is File1.csv without File2.csv, iex File3.csv (the output) should look as following:
File3.csv:
Name, Email
Jon, jon@email.com
Roberto, roberto@email.com
What is the simplest way to code this in Python?
dont_need_em = []
with open("file2.csv", 'r') as fn:
for line in fn:
if not line.startswith("Email"):
dont_need_em.append(line.rstrip())
fw = open("file3.csv", 'w')
with open("file1.csv", 'r') as fn:
for line in fn:
if line.rstrip().split(", ")[1] not in dont_need_em:
fw.write(line.rstrip())
fw.close()
This should do it, but i am sure there are way simpler solutions
EDIT: Create the third file
Using Pandas you can do this:
import pandas as pd
#Read two files into data frame using column names from first row
file1=pd.read_csv('File1.csv',header=0,skipinitialspace=True)
file2=pd.read_csv('File2.csv',header=0,skipinitialspace=True)
#Only return lines in file 1 if the email is not contained in file 2
cleaned=file1[~file1["Email"].isin(file2["Email"])]
#Output file to CSV with original headers
cleaned.to_csv("File3.csv", index=False)
Here's a good way to do that (it's very similar to the above, but writes the remainder to file rather than printing:
Removed = []
with open("file2.csv", 'r') as f2:
for line in f2:
if not line.startswith("Email"):
removed.append(line.rstrip())
with open("file1.csv", 'r') as f1:
with open("file3.csv", 'w') as f3:
for line in f1:
if line.rstrip().split(", ")[1] not in removed:
f3.write(line)
How this works: The first block reads all the emails you want to filter out into a list. Next, the second block opens your original file and sets up a new file to write what's left. It reads each line from your first file and writes them to the third file only if the email isn't in your list to filter
If you are under UNIX:
#! /usr/bin/env python
import subprocess
import sys
def filter(input_file, filter_file, out_file):
subprocess.call("grep -f '%s' '%s' > '%s' " % (filter_file, input_file, out_file), shell=True)
The following should do what you are looking for. First read File2.csv
into a set
of email addresses to be skipped. Then read File1.csv
row by row, writing only rows which are not in the skip list:
import csv
with open('File2.csv', 'r') as file2:
skip_list = set(line.strip() for line in file2.readlines()[1:])
with open('File1.csv', 'rb') as file1, open('File3.csv', 'wb') as file3:
csv_file1 = csv.reader(file1, skipinitialspace=True)
csv_file3 = csv.writer(file3)
csv_file3.writerow(next(csv_file1)) # Write the header line
for cols in csv_file1:
if cols[1] not in skip_list:
csv_file3.writerow(cols)
This would give you the following output in File3.csv
:
Name,Email
Jon,jon@email.com
Roberto,roberto@email.com
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.