[英]How to merge multiple(more than 2) csv files based on their common column?
现在我有 50 个具有相同列的 CSV 文件,如下所示:
gdp1950.csv
id,gdp
a,100
b,200
c,300
gdp1951.csv
id,gdp
a,400
b,500
c,600
...
gdp2000.csv
id,gdp
a,700
b,800
c,900
我要做的是像这样合并上面的csv文件:
id,gdp1950,gdp1951,...,gdp2000
a,100,400,...,700
b,200,500,...,800
c,300,600,...,900
该任务需要通过 Python 在 jupyter notebook 中完成。 有任何想法吗?
您可以使用名为pandas的库,它非常适合此任务:
from functools import reduce
dfs = [pd.read_csv(f"gdp{i}.csv") for i in range(1950, 2001)]
df = reduce(lambda df1, df2: pd.merge(left=df1, right=df2, on=["id"], how="inner"), dfs)
您可以使用 vanilla python 解决它,不需要第三方库或模块:
outputDict = {"id" : []}
for i in range(1950, 2001):
outputDict["id"].append(f"gdp{i}")
with open(f"gdp{i}.csv", "r") as file:
file.readline() # We don't need that line
for line in file:
key, value = line.rstrip("\n").split(",")
if key in outputDict:
outputDict[key].append(value)
else:
outputDict[key] = [value]
with open("gdpTotal.csv", "w") as output:
output.write("\n".join(",".join((k, *[i for i in v])) for k, v in outputDict.items())) # Convert the dictionary of lists into a suitable string for file writing
最后一行"\\n".join(",".join((k, *[i for i in v])) for k, v in outputDict.items())
相当于(结果相同但过程不是)
for k, v in outputDict.items():
output.write(f"{k},{','.join(v)}\n")
此外,您可以使用collections.defaultdict
删除 if 语句。 此外,它的速度略快。
outputDict = defaultdict(list)
for i in range(1950, 2001):
outputDict["id"].append(f"gdp{i}")
with open(f"gdp{i}.csv", "r") as file:
file.readline()
for line in file:
key, value = line.rstrip("\n").split(",")
outputDict[key].append(value)
with open("gdpTotal.csv", "w") as output:
output.write("\n".join(",".join((k, *[i for i in v])) for k, v in outputDict.items()))
使用timeit.timeit
(参数number = 100
)第一个代码需要0.825195171
秒(第二个代码0.8229198819999999
)。 而是使用熊猫:
from functools import reduce
import pandas as pd
dfs = [pd.read_csv(f"gdp{i}.csv") for i in range(1950, 2001)]
df = reduce(lambda df1, df2: pd.merge(left=df1, right=df2, on=["id"], how="inner"), dfs)
df.to_csv("gdpTotal.csv")
需要32.095738075999996
秒。 它可能需要更少的行,但速度要慢得多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.