如何根据公共列合并多个（超过 2 个）csv 文件？

Question

Now I have 50 CSV files with the same column like below:现在我有 50 个具有相同列的 CSV 文件，如下所示：

gdp1950.csv gdp1950.csv

id,gdp
a,100
b,200
c,300

gdp1951.csv gdp1951.csv

id,gdp
a,400
b,500
c,600

... ...

gdp2000.csv gdp2000.csv

id,gdp
a,700
b,800
c,900

What I am going to do is merge the csv files above like this:我要做的是像这样合并上面的csv文件：

id,gdp1950,gdp1951,...,gdp2000
a,100,400,...,700
b,200,500,...,800
c,300,600,...,900

The task are required to be done in jupyter notebook by Python.该任务需要通过 Python 在 jupyter notebook 中完成。 Any ideas?有任何想法吗？

Answer 1

You can use a library called pandas , which is perfect for this task:您可以使用名为pandas的库，它非常适合此任务：

from functools import reduce
dfs = [pd.read_csv(f"gdp{i}.csv") for i in range(1950, 2001)]
df = reduce(lambda df1, df2: pd.merge(left=df1, right=df2, on=["id"], how="inner"), dfs)

Answer 2

You can solve it using vanilla python, no need of third-party libraries nor modules:您可以使用 vanilla python 解决它，不需要第三方库或模块：

outputDict = {"id" : []}
for i in range(1950, 2001):
    outputDict["id"].append(f"gdp{i}")
    with open(f"gdp{i}.csv", "r") as file:
        file.readline()    # We don't need that line
        for line in file:
            key, value = line.rstrip("\n").split(",")
            if key in outputDict:
                outputDict[key].append(value)
            else:
                outputDict[key] = [value]

with open("gdpTotal.csv", "w") as output:
     output.write("\n".join(",".join((k, *[i for i in v])) for k, v in outputDict.items()))    # Convert the dictionary of lists into a suitable string for file writing

The last line "\\n".join(",".join((k, *[i for i in v])) for k, v in outputDict.items()) is something equivalent to (the result is the same but the process isn't)最后一行"\\n".join(",".join((k, *[i for i in v])) for k, v in outputDict.items())相当于（结果相同但过程不是）

for k, v in outputDict.items():
    output.write(f"{k},{','.join(v)}\n")

Also, you could use collections.defaultdict to remove the if statement.此外，您可以使用collections.defaultdict删除 if 语句。 In addition, it's slightly faster.此外，它的速度略快。

outputDict = defaultdict(list)
for i in range(1950, 2001):
    outputDict["id"].append(f"gdp{i}")
    with open(f"gdp{i}.csv", "r") as file:
        file.readline()
        for line in file:
            key, value = line.rstrip("\n").split(",")
            outputDict[key].append(value)

with open("gdpTotal.csv", "w") as output:
     output.write("\n".join(",".join((k, *[i for i in v])) for k, v in outputDict.items()))

Using timeit.timeit (with the parameter number = 100 ) it takes 0.825195171 seconds the first code ( 0.8229198819999999 the second code).使用timeit.timeit （参数number = 100 ）第一个代码需要0.825195171秒（第二个代码0.8229198819999999 ）。 Instead the usage of pandas:而是使用熊猫：

from functools import reduce
import pandas as pd
dfs = [pd.read_csv(f"gdp{i}.csv") for i in range(1950, 2001)]
df = reduce(lambda df1, df2: pd.merge(left=df1, right=df2, on=["id"], how="inner"), dfs)
df.to_csv("gdpTotal.csv")

Takes 32.095738075999996 seconds.需要32.095738075999996秒。 It may take fewer lines but it's much slower.它可能需要更少的行，但速度要慢得多。

如何根据公共列合并多个（超过 2 个）csv 文件？

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-03-23 14:34:23

解决方案2
0 2019-03-23 16:43:31

如何根据公共列合并多个（超过 2 个）csv 文件？

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-03-23 14:34:23

解决方案2 0 2019-03-23 16:43:31

解决方案1
2 已采纳 2019-03-23 14:34:23

解决方案2
0 2019-03-23 16:43:31