简体   繁体   English

将文件夹的多个 csv 文件加载到一个数据框中

[英]Loading multiple csv files of a folder into one dataframe

i have multiple csv files saved in one folder with the same column layout and want to load it into python as a dataframe in pandas.我在一个文件夹中保存了多个具有相同列布局的 csv 文件,并希望将其作为 Pandas 中的数据帧加载到 python 中。

The question is really simliar to this thread.这个问题与这个线程非常相似

I am using the following code:我正在使用以下代码:

import glob
import pandas as pd
salesdata = pd.DataFrame()
for f in glob.glob("TransactionData\Promorelevant\*.csv"):
    appenddata = pd.read_csv(f, header=None, sep=";")
    salesdata = salesdata.append(appenddata,ignore_index=True)

Is there a better solution for it with another package?有另一个包有更好的解决方案吗?

This is taking to much time.这需要很多时间。

Thanks谢谢

I suggest use list comprehension with concat :我建议使用concat列表理解:

import glob
import pandas as pd

files = glob.glob("TransactionData\Promorelevant*.csv")
dfs = [pd.read_csv(f, header=None, sep=";") for f in files]

salesdata = pd.concat(dfs,ignore_index=True)

With a help from link to actual answer链接到实际答案的帮助下

This seems to be the best one liner:这似乎是最好的一个班轮:

import glob, os    
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "*.csv"))))

Maybe using bash will be faster:也许使用 bash 会更快:

head -n 1 "TransactionData/Promorelevant/0.csv" > merged.csv
tail -q -n +2 TransactionData/Promorelevant*.csv >> merged.csv

Or if using from within a jupyter notebook或者如果在 jupyter notebook 中使用

!head -n 1 "TransactionData/Promorelevant/0.csv" > merged.csv
!tail -q -n +2 "TransactionData/Promorelevant*.csv" >> merged.csv

The idea being that you won't need to parse anything.这个想法是你不需要解析任何东西。

The first command copies the header of one of the files.第一个命令复制其中一个文件的标题。 You can skip this line if you don't have a header.如果您没有标题,则可以跳过此行。 Tail skips the headers for all the files and adds them to the csv. Tail 跳过所有文件的标题并将它们添加到 csv。

Appending in Python is probably more expensive.在 Python 中追加可能更昂贵。

Of course, make sure your parse is still valid using pandas.当然,使用 Pandas 确保您的解析仍然有效。

pd.read_csv("merged.csv")

Curious to your benchmark.对你的基准感到好奇。

i checked all this approaches except the bash one with the time function (only one run, and also note that the files are on a shared drive).我检查了所有这些方法,除了带有 time 函数的 bash 方法(只运行一次,还要注意文件位于共享驱动器上)。

Here are the results:结果如下:

My approach: 1220.49我的方法:1220.49

List comphrension+concat: 1135.53列表理解+连接:1135.53

concat+map+join: 1116.31连接+地图+连接:1116.31

I will go for list comphrension+concat which will save me some minutes and i feel quite familiar with.我将使用 list comphrension+concat 这将节省我一些时间,我觉得很熟悉。

Thanks for your ideas.谢谢你的想法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM