[英]Merge two CSV's with unique columns in python
I have two CSV files representing data from two different years. 我有两个CSV文件,分别代表两个不同年份的数据。 I know how to do the basic merging using csvwriter and dictkeys, but the problem lies here: while the CSVs have mostly shared column headers, each may have unique columns. 我知道如何使用csvwriter和dictkeys进行基本合并,但是问题出在这里:尽管CSV大多具有共享的列标题,但每个CSV可能都有唯一的列。 If a species was caught in one year but not the other, that column would only be present in that year. 如果某一物种是在一年中捕获的,而另一年没有捕获,则该列只会在该年出现。 How can I merge the new data to the old data, creating new columns and padding the old data with zero in those columns? 如何将新数据合并到旧数据,创建新列并在这些列中用零填充旧数据?
File 1: "Date","Time","Species A","Species B", "Species X"
文件1: "Date","Time","Species A","Species B", "Species X"
File 2: "Date","Time", "Species A", "Species B", "Species C"
文件2: "Date","Time", "Species A", "Species B", "Species C"
I need the end result to be one csv with this header: " Date","Time","Species A","Species B", "Species C", "Species X"
我需要最终结果是具有以下标头的一个csv:“ Date","Time","Species A","Species B", "Species C", "Species X"
Someone else will probably post a solution using the csv
module, so I'll give a pandas solution for comparison purposes: 其他人可能会使用csv
模块发布解决方案,因此,出于比较目的,我将给出一个熊猫解决方案:
import pandas as pd
df1 = pd.read_csv("fish1.csv")
df2 = pd.read_csv("fish2.csv")
df = pd.concat([df1, df2]).fillna(0)
df = df[["Date", "Time"] + list(df.columns[1:-1])]
df.to_csv("merged_fish.csv", index=False)
Explanation: 说明:
First, we read in the two files: 首先,我们读入两个文件:
>>> df1 = pd.read_csv("fish1.csv")
>>> df2 = pd.read_csv("fish2.csv")
>>> df1
Date Time Species A Species B Species X
0 1 2 3 4 5
1 6 7 8 9 10
2 11 12 13 14 15
>>> df2
Date Time Species A Species B Species C
0 16 17 18 19 20
1 21 22 23 24 25
2 26 27 28 29 30
Then we simply concatenate them, which automatically fills the missing data with NaN
: 然后,我们简单地将它们连接起来,这会自动用NaN
填充丢失的数据:
>>> df = pd.concat([df1, df2])
>>> df
Date Species A Species B Species C Species X Time
0 1 3 4 NaN 5 2
1 6 8 9 NaN 10 7
2 11 13 14 NaN 15 12
0 16 18 19 20 NaN 17
1 21 23 24 25 NaN 22
2 26 28 29 30 NaN 27
You want them filled with 0 instead, so: 您希望它们填充为0,所以:
>>> df = pd.concat([df1, df2]).fillna(0)
>>> df
Date Species A Species B Species C Species X Time
0 1 3 4 0 5 2
1 6 8 9 0 10 7
2 11 13 14 0 15 12
0 16 18 19 20 0 17
1 21 23 24 25 0 22
2 26 28 29 30 0 27
This order isn't quite the one you asked for, though, you wanted Time
and Date
first, so: 但是,此订单并非您所要求的,您首先需要Time
和Date
,因此:
>>> df = df[["Date", "Time"] + list(df.columns[1:-1])]
>>> df
Date Time Species A Species B Species C Species X
0 1 2 3 4 0 5
1 6 7 8 9 0 10
2 11 12 13 14 0 15
0 16 17 18 19 20 0
1 21 22 23 24 25 0
2 26 27 28 29 30 0
And then we save it as a CSV file: 然后我们将其另存为CSV文件:
>>> df.to_csv("merged_fish.csv", index=False)
producing 生产
Date,Time,Species A,Species B,Species C,Species X
1,2,3,4,0.0,5.0
6,7,8,9,0.0,10.0
11,12,13,14,0.0,15.0
16,17,18,19,20.0,0.0
21,22,23,24,25.0,0.0
26,27,28,29,30.0,0.0
Here's a csv
module solution in Python 3: 这是Python 3中的csv
模块解决方案:
import csv
# Generate some data...
csv1 = '''\
Date,Time,Species A,Species B,Species C
04/01/2012,13:00,1,2,3
04/02/2012,13:00,1,2,3
04/03/2012,13:00,1,2,3
04/04/2012,13:00,1,2,3
'''
csv2 = '''\
Date,Time,Species A,Species B,Species X
04/01/2013,13:00,1,2,3
04/02/2013,13:00,1,2,3
04/03/2013,13:00,1,2,3
04/04/2013,13:00,1,2,3
'''
with open('2012.csv','w') as f:
f.write(csv1)
with open('2013.csv','w') as f:
f.write(csv2)
# The actual program
years = ['2012.csv','2013.csv']
lines = []
headers = set()
for year in years:
with open(year,'r',newline='') as f:
r = csv.DictReader(f)
lines.extend(list(r)) # Merge lines from all files.
headers = headers.union(r.fieldnames) # Collect unique column names.
# Sort the unique headers keeping Date,Time columns first.
new_headers = ['Date','Time'] + sorted(headers - set(['Date','Time']))
with open('result.csv','w',newline='') as f:
# The 3rd parameter is the default if the key isn't present.
w = csv.DictWriter(f,new_headers,0)
w.writeheader()
w.writerows(lines)
# View the result
with open('result.csv') as f:
print(f.read())
Output: 输出:
Date,Time,Species A,Species B,Species C,Species X
04/01/2012,13:00,1,2,3,0
04/02/2012,13:00,1,2,3,0
04/03/2012,13:00,1,2,3,0
04/04/2012,13:00,1,2,3,0
04/01/2013,13:00,1,2,0,3
04/02/2013,13:00,1,2,0,3
04/03/2013,13:00,1,2,0,3
04/04/2013,13:00,1,2,0,3
根据文档 ,看来您应该能够读出两个文件,合并两个提取的字典中的键,然后使用restval
上的fieldnames
和restval
参数实现默认值0。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.