简体   繁体   English

如何有效地连接数千个 Pandas DataFrames?

[英]How to concatenate thousands of Pandas DataFrames efficiently?

I have a folder /data/csvs which contains ~7000 CSV files each with ~600 lines.我有一个文件夹/data/csvs ,其中包含 ~7000 个 CSV 文件,每个文件有 ~600 行。 Each CSV has a name which contains a timestamp that needs to be preserved eg /data/csvs/261121.csv , /data/csvs/261122.csv ( 261121 being 26/11/21 today's date). Each CSV has a name which contains a timestamp that needs to be preserved eg /data/csvs/261121.csv , /data/csvs/261122.csv ( 261121 being 26/11/21 today's date).

I need to:我需要:

  1. Load each CSV.加载每个 CSV。
  2. Add a column in which the timestamp can be saved so I know which file the data came from.添加一个可以保存时间戳的列,以便我知道数据来自哪个文件。 The time increases by half a second each row so this row also shows the hour/minute/second/microseconds.时间每行增加半秒,因此该行还显示小时/分钟/秒/微秒。
  3. Combine the rows into one table which will span a month of data.将这些行合并到一个表中,该表将跨越一个月的数据。
  4. Ideally I'd like the final product to be a DataFrame.理想情况下,我希望最终产品是 DataFrame。

Currently this is what I'm doing:目前这就是我正在做的事情:

    files = os.listdir('/data/csvs')
    csv_names = []
    for file_name in files:
        if file_name[-4:] == '.csv':
            csv_names.append(file_name)

    to_process = len(csv_names)
    for i, csv_name in enumerate(csv_names):
        df = pd.read_csv(f'{csv_folder_path}/{file_name}')
        df = timestamp(df, csv_name)

        to_process = to_process-1

        if i == 0:
            concat_df = df
            concat_df.to_feather(path=processed_path)
        else:
            concat_df = pd.concat([concat_df, df])

            if to_process % 100 == 0:
                saved_df = pd.read_feather(path=processed_path)
                concat_df = pd.concat([saved_df, concat_df])
                concat_df.reset_index(drop=True, inplace=True)
                concat_df.to_feather(path=processed_path)

I'm loading in each CSV as a DataFrame, adding the timestamp column and concatenating the CSVs 100 at a time (because I thought this would reduce memory usage) and then saving 100 CSVs at a time to a large DataFrame feather file. I'm loading in each CSV as a DataFrame, adding the timestamp column and concatenating the CSVs 100 at a time (because I thought this would reduce memory usage) and then saving 100 CSVs at a time to a large DataFrame feather file. This is really slow and uses loads of memory.这真的很慢,并且使用了大量的 memory。

What is a more efficient way of doing this?有什么更有效的方法来做到这一点?

First, you could be more efficient loading your files using glob .首先,您可以更有效地使用glob加载文件。 This saves you iterating over all the files and checking whether the file-extension is ".csv"这样可以节省您遍历所有文件并检查文件扩展名是否为“.csv”

import glob

src = '/data/csvs'
files = glob.iglob(os.path.join(src, "*.csv"))

Then, read all files into a df and add them to a generator, in the same step assigning the basename of the file to a column named timestamp然后,将所有文件读入 df 并将它们添加到生成器中,在同一步骤中将文件的基本名称分配给名为 timestamp 的列

df_from_each_file = (pd.read_csv(f).assign(timestamp=os.path.basename(f).split('.')[0]) for f in files)

And finally concatenate the dfs into one最后将 dfs 连接成一个

csv_data = pd.concat(df_from_each_file, ignore_index=True)

Hope this helped.希望这有帮助。 I have used a process like this for large amounts of data and found it efficient enough.我已经对大量数据使用了这样的过程,并且发现它足够有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM