从多个文件中读取大数据并在python中聚合数据的最快方法是什么？

Question

I have many files: 1.csv, 2.csv ... N.csv.我有很多文件：1.csv、2.csv ... N.csv。 I want to read them all and aggregate a DataFrame.我想阅读它们并聚合一个 DataFrame。 But reading files sequentially in one process will definitely be slow.但是在一个过程中顺序读取文件肯定会很慢。 So how can I improve it?那么我该如何改进呢？ Besides, Jupyter notebook is used.此外，还使用了 Jupyter 笔记本。

Also, I am a little confused about the "cost of parsing parameters or return values between python processes"另外，我对“python进程之间解析参数或返回值的成本”有点困惑

I know the question may be duplicated.我知道这个问题可能是重复的。 But I found that most of the answers use multi-process to solve it.但是我发现大部分答案都是用多进程来解决的。 Multiprocess does solve the GIL problem.多进程确实解决了GIL问题。 But in my experience(maybe it is wrong): parsing large data(like a DataFrame) as a parameter to subprocess is slower than a for loop in a single process because the procedure needs serializing and de-serializing.但根据我的经验（也许这是错误的）：解析大数据（如 DataFrame）作为子进程的参数比单个进程中的 for 循环慢，因为该过程需要序列化和反序列化。 And I am not sure about the return of large values from the subprocess.而且我不确定从子流程返回的大值。

Is it most efficient to use a Qeueu or joblib or Ray ?使用Qeueu或joblib或Ray是否最有效？

Answer 1

Reading csv is fast.读取csv很快。 I would read all csv in a list and then concat the list to one dataframe.我会读取列表中的所有 csv，然后将列表连接到一个数据帧。 Here is a bit of code form my use case.这是我的用例的一些代码。 I find all .csv files in my path and save the csv file names in variable "results".我在我的路径中找到所有 .csv 文件，并将 csv 文件名保存在变量“results”中。 I then loop the file names and read the csv and store it in list which I later concat to one dataframe.然后我循环文件名并读取 csv 并将其存储在列表中，然后我将其连接到一个数据帧。

data = []
for item in result:
   data.append(pd.read_csv(path))
main_df = pd.concat(data, axis = 0)

I am not saying this is the best approach, but this works great for me :)我并不是说这是最好的方法，但这对我很有用:)

从多个文件中读取大数据并在python中聚合数据的最快方法是什么？

问题描述

1 个解决方案

解决方案1
0 2021-11-11 08:18:34

从多个文件中读取大数据并在python中聚合数据的最快方法是什么？

问题描述

1 个解决方案

解决方案1 0 2021-11-11 08:18:34

解决方案1
0 2021-11-11 08:18:34