[英]How to concatenate thousands of Pandas DataFrames efficiently?
I have a folder /data/csvs
which contains ~7000 CSV files each with ~600 lines.我有一个文件夹
/data/csvs
,其中包含 ~7000 个 CSV 文件,每个文件有 ~600 行。 Each CSV has a name which contains a timestamp that needs to be preserved eg /data/csvs/261121.csv
, /data/csvs/261122.csv
( 261121
being 26/11/21 today's date). Each CSV has a name which contains a timestamp that needs to be preserved eg
/data/csvs/261121.csv
, /data/csvs/261122.csv
( 261121
being 26/11/21 today's date).
I need to:我需要:
Currently this is what I'm doing:目前这就是我正在做的事情:
files = os.listdir('/data/csvs')
csv_names = []
for file_name in files:
if file_name[-4:] == '.csv':
csv_names.append(file_name)
to_process = len(csv_names)
for i, csv_name in enumerate(csv_names):
df = pd.read_csv(f'{csv_folder_path}/{file_name}')
df = timestamp(df, csv_name)
to_process = to_process-1
if i == 0:
concat_df = df
concat_df.to_feather(path=processed_path)
else:
concat_df = pd.concat([concat_df, df])
if to_process % 100 == 0:
saved_df = pd.read_feather(path=processed_path)
concat_df = pd.concat([saved_df, concat_df])
concat_df.reset_index(drop=True, inplace=True)
concat_df.to_feather(path=processed_path)
I'm loading in each CSV as a DataFrame, adding the timestamp column and concatenating the CSVs 100 at a time (because I thought this would reduce memory usage) and then saving 100 CSVs at a time to a large DataFrame feather file. I'm loading in each CSV as a DataFrame, adding the timestamp column and concatenating the CSVs 100 at a time (because I thought this would reduce memory usage) and then saving 100 CSVs at a time to a large DataFrame feather file. This is really slow and uses loads of memory.
这真的很慢,并且使用了大量的 memory。
What is a more efficient way of doing this?有什么更有效的方法来做到这一点?
First, you could be more efficient loading your files using glob
.首先,您可以更有效地使用
glob
加载文件。 This saves you iterating over all the files and checking whether the file-extension is ".csv"这样可以节省您遍历所有文件并检查文件扩展名是否为“.csv”
import glob
src = '/data/csvs'
files = glob.iglob(os.path.join(src, "*.csv"))
Then, read all files into a df and add them to a generator, in the same step assigning the basename of the file to a column named timestamp然后,将所有文件读入 df 并将它们添加到生成器中,在同一步骤中将文件的基本名称分配给名为 timestamp 的列
df_from_each_file = (pd.read_csv(f).assign(timestamp=os.path.basename(f).split('.')[0]) for f in files)
And finally concatenate the dfs into one最后将 dfs 连接成一个
csv_data = pd.concat(df_from_each_file, ignore_index=True)
Hope this helped.希望这有帮助。 I have used a process like this for large amounts of data and found it efficient enough.
我已经对大量数据使用了这样的过程,并且发现它足够有效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.