如何有效地连接数千个 Pandas DataFrames？

Question

I have a folder /data/csvs which contains ~7000 CSV files each with ~600 lines.我有一个文件夹/data/csvs ，其中包含 ~7000 个 CSV 文件，每个文件有 ~600 行。 Each CSV has a name which contains a timestamp that needs to be preserved eg /data/csvs/261121.csv , /data/csvs/261122.csv ( 261121 being 26/11/21 today's date). Each CSV has a name which contains a timestamp that needs to be preserved eg /data/csvs/261121.csv , /data/csvs/261122.csv ( 261121 being 26/11/21 today's date).

I need to:我需要：

Load each CSV.加载每个 CSV。
Add a column in which the timestamp can be saved so I know which file the data came from.添加一个可以保存时间戳的列，以便我知道数据来自哪个文件。 The time increases by half a second each row so this row also shows the hour/minute/second/microseconds.时间每行增加半秒，因此该行还显示小时/分钟/秒/微秒。
Combine the rows into one table which will span a month of data.将这些行合并到一个表中，该表将跨越一个月的数据。
Ideally I'd like the final product to be a DataFrame.理想情况下，我希望最终产品是 DataFrame。

Currently this is what I'm doing:目前这就是我正在做的事情：

    files = os.listdir('/data/csvs')
    csv_names = []
    for file_name in files:
        if file_name[-4:] == '.csv':
            csv_names.append(file_name)

    to_process = len(csv_names)
    for i, csv_name in enumerate(csv_names):
        df = pd.read_csv(f'{csv_folder_path}/{file_name}')
        df = timestamp(df, csv_name)

        to_process = to_process-1

        if i == 0:
            concat_df = df
            concat_df.to_feather(path=processed_path)
        else:
            concat_df = pd.concat([concat_df, df])

            if to_process % 100 == 0:
                saved_df = pd.read_feather(path=processed_path)
                concat_df = pd.concat([saved_df, concat_df])
                concat_df.reset_index(drop=True, inplace=True)
                concat_df.to_feather(path=processed_path)

I'm loading in each CSV as a DataFrame, adding the timestamp column and concatenating the CSVs 100 at a time (because I thought this would reduce memory usage) and then saving 100 CSVs at a time to a large DataFrame feather file. I'm loading in each CSV as a DataFrame, adding the timestamp column and concatenating the CSVs 100 at a time (because I thought this would reduce memory usage) and then saving 100 CSVs at a time to a large DataFrame feather file. This is really slow and uses loads of memory.这真的很慢，并且使用了大量的 memory。

What is a more efficient way of doing this?有什么更有效的方法来做到这一点？

Answer 1

First, you could be more efficient loading your files using glob .首先，您可以更有效地使用glob加载文件。 This saves you iterating over all the files and checking whether the file-extension is ".csv"这样可以节省您遍历所有文件并检查文件扩展名是否为“.csv”

import glob

src = '/data/csvs'
files = glob.iglob(os.path.join(src, "*.csv"))

Then, read all files into a df and add them to a generator, in the same step assigning the basename of the file to a column named timestamp然后，将所有文件读入 df 并将它们添加到生成器中，在同一步骤中将文件的基本名称分配给名为 timestamp 的列

df_from_each_file = (pd.read_csv(f).assign(timestamp=os.path.basename(f).split('.')[0]) for f in files)

And finally concatenate the dfs into one最后将 dfs 连接成一个

csv_data = pd.concat(df_from_each_file, ignore_index=True)

Hope this helped.希望这有帮助。 I have used a process like this for large amounts of data and found it efficient enough.我已经对大量数据使用了这样的过程，并且发现它足够有效。

如何有效地连接数千个 Pandas DataFrames？

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-11-26 15:47:57

如何有效地连接数千个 Pandas DataFrames？

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-11-26 15:47:57

解决方案1
0 已采纳 2021-11-26 15:47:57