在Python中合并两个带有唯一列的CSV文件

Question

I have two CSV files representing data from two different years. 我有两个CSV文件，分别代表两个不同年份的数据。 I know how to do the basic merging using csvwriter and dictkeys, but the problem lies here: while the CSVs have mostly shared column headers, each may have unique columns. 我知道如何使用csvwriter和dictkeys进行基本合并，但是问题出在这里：尽管CSV大多具有共享的列标题，但每个CSV可能都有唯一的列。 If a species was caught in one year but not the other, that column would only be present in that year. 如果某一物种是在一年中捕获的，而另一年没有捕获，则该列只会在该年出现。 How can I merge the new data to the old data, creating new columns and padding the old data with zero in those columns? 如何将新数据合并到旧数据，创建新列并在这些列中用零填充旧数据？

File 1: "Date","Time","Species A","Species B", "Species X" 文件1： "Date","Time","Species A","Species B", "Species X"

File 2: "Date","Time", "Species A", "Species B", "Species C" 文件2： "Date","Time", "Species A", "Species B", "Species C"

I need the end result to be one csv with this header: " Date","Time","Species A","Species B", "Species C", "Species X" 我需要最终结果是具有以下标头的一个csv：“ Date","Time","Species A","Species B", "Species C", "Species X"

Answer 1

Someone else will probably post a solution using the csv module, so I'll give a pandas solution for comparison purposes: 其他人可能会使用csv模块发布解决方案，因此，出于比较目的，我将给出一个熊猫解决方案：

import pandas as pd

df1 = pd.read_csv("fish1.csv")
df2 = pd.read_csv("fish2.csv")

df = pd.concat([df1, df2]).fillna(0)
df = df[["Date", "Time"] + list(df.columns[1:-1])]
df.to_csv("merged_fish.csv", index=False)

Explanation: 说明：

First, we read in the two files: 首先，我们读入两个文件：

>>> df1 = pd.read_csv("fish1.csv")
>>> df2 = pd.read_csv("fish2.csv")
>>> df1
   Date  Time  Species A  Species B  Species X
0     1     2          3          4          5
1     6     7          8          9         10
2    11    12         13         14         15
>>> df2
   Date  Time  Species A  Species B  Species C
0    16    17         18         19         20
1    21    22         23         24         25
2    26    27         28         29         30

Then we simply concatenate them, which automatically fills the missing data with NaN : 然后，我们简单地将它们连接起来，这会自动用NaN填充丢失的数据：

>>> df = pd.concat([df1, df2])
>>> df
   Date  Species A  Species B  Species C  Species X  Time
0     1          3          4        NaN          5     2
1     6          8          9        NaN         10     7
2    11         13         14        NaN         15    12
0    16         18         19         20        NaN    17
1    21         23         24         25        NaN    22
2    26         28         29         30        NaN    27

You want them filled with 0 instead, so: 您希望它们填充为0，所以：

>>> df = pd.concat([df1, df2]).fillna(0)
>>> df
   Date  Species A  Species B  Species C  Species X  Time
0     1          3          4          0          5     2
1     6          8          9          0         10     7
2    11         13         14          0         15    12
0    16         18         19         20          0    17
1    21         23         24         25          0    22
2    26         28         29         30          0    27

This order isn't quite the one you asked for, though, you wanted Time and Date first, so: 但是，此订单并非您所要求的，您首先需要Time和Date ，因此：

>>> df = df[["Date", "Time"] + list(df.columns[1:-1])]
>>> df
   Date  Time  Species A  Species B  Species C  Species X
0     1     2          3          4          0          5
1     6     7          8          9          0         10
2    11    12         13         14          0         15
0    16    17         18         19         20          0
1    21    22         23         24         25          0
2    26    27         28         29         30          0

And then we save it as a CSV file: 然后我们将其另存为CSV文件：

>>> df.to_csv("merged_fish.csv", index=False)

producing 生产

Date,Time,Species A,Species B,Species C,Species X
1,2,3,4,0.0,5.0
6,7,8,9,0.0,10.0
11,12,13,14,0.0,15.0
16,17,18,19,20.0,0.0
21,22,23,24,25.0,0.0
26,27,28,29,30.0,0.0

Answer 2

Here's a csv module solution in Python 3: 这是Python 3中的csv模块解决方案：

import csv

# Generate some data...

csv1 = '''\
Date,Time,Species A,Species B,Species C
04/01/2012,13:00,1,2,3
04/02/2012,13:00,1,2,3
04/03/2012,13:00,1,2,3
04/04/2012,13:00,1,2,3
'''

csv2 = '''\
Date,Time,Species A,Species B,Species X
04/01/2013,13:00,1,2,3
04/02/2013,13:00,1,2,3
04/03/2013,13:00,1,2,3
04/04/2013,13:00,1,2,3
'''

with open('2012.csv','w') as f:
    f.write(csv1)
with open('2013.csv','w') as f:
    f.write(csv2)

# The actual program

years = ['2012.csv','2013.csv']

lines = []
headers = set()
for year in years:
    with open(year,'r',newline='') as f:
        r = csv.DictReader(f)
        lines.extend(list(r))                 # Merge lines from all files.
        headers = headers.union(r.fieldnames) # Collect unique column names.

# Sort the unique headers keeping Date,Time columns first.
new_headers = ['Date','Time'] + sorted(headers - set(['Date','Time']))

with open('result.csv','w',newline='') as f:
    # The 3rd parameter is the default if the key isn't present.
    w = csv.DictWriter(f,new_headers,0)
    w.writeheader()
    w.writerows(lines)

# View the result

with open('result.csv') as f:
    print(f.read())

Output: 输出：

Date,Time,Species A,Species B,Species C,Species X
04/01/2012,13:00,1,2,3,0
04/02/2012,13:00,1,2,3,0
04/03/2012,13:00,1,2,3,0
04/04/2012,13:00,1,2,3,0
04/01/2013,13:00,1,2,0,3
04/02/2013,13:00,1,2,0,3
04/03/2013,13:00,1,2,0,3
04/04/2013,13:00,1,2,0,3

Answer 3

根据文档，看来您应该能够读出两个文件，合并两个提取的字典中的键，然后使用restval上的fieldnames和restval参数实现默认值0。

在Python中合并两个带有唯一列的CSV文件

问题描述

3 个解决方案

解决方案1
5 已采纳 2013-04-15 14:36:58

解决方案2
1 2013-04-15 15:24:05

解决方案3
0 2013-04-15 14:06:38

在Python中合并两个带有唯一列的CSV文件

问题描述

3 个解决方案

解决方案1 5 已采纳 2013-04-15 14:36:58

解决方案2 1 2013-04-15 15:24:05

解决方案3 0 2013-04-15 14:06:38

解决方案1
5 已采纳 2013-04-15 14:36:58

解决方案2
1 2013-04-15 15:24:05

解决方案3
0 2013-04-15 14:06:38