[英]What is the fastest way to read large data from multiple files and aggregate data in python?
I have many files: 1.csv, 2.csv ... N.csv.我有很多文件:1.csv、2.csv ... N.csv。 I want to read them all and aggregate a DataFrame.
我想阅读它们并聚合一个 DataFrame。 But reading files sequentially in one process will definitely be slow.
但是在一个过程中顺序读取文件肯定会很慢。 So how can I improve it?
那么我该如何改进呢? Besides, Jupyter notebook is used.
此外,还使用了 Jupyter 笔记本。
Also, I am a little confused about the "cost of parsing parameters or return values between python processes"另外,我对“python进程之间解析参数或返回值的成本”有点困惑
I know the question may be duplicated.我知道这个问题可能是重复的。 But I found that most of the answers use multi-process to solve it.
但是我发现大部分答案都是用多进程来解决的。 Multiprocess does solve the
GIL
problem.多进程确实解决了
GIL
问题。 But in my experience(maybe it is wrong): parsing large data(like a DataFrame) as a parameter to subprocess is slower than a for loop in a single process because the procedure needs serializing and de-serializing.但根据我的经验(也许这是错误的):解析大数据(如 DataFrame)作为子进程的参数比单个进程中的 for 循环慢,因为该过程需要序列化和反序列化。 And I am not sure about the return of large values from the subprocess.
而且我不确定从子流程返回的大值。
Is it most efficient to use a Qeueu
or joblib
or Ray
?使用
Qeueu
或joblib
或Ray
是否最有效?
Reading csv is fast.读取csv很快。 I would read all csv in a list and then concat the list to one dataframe.
我会读取列表中的所有 csv,然后将列表连接到一个数据帧。 Here is a bit of code form my use case.
这是我的用例的一些代码。 I find all .csv files in my path and save the csv file names in variable "results".
我在我的路径中找到所有 .csv 文件,并将 csv 文件名保存在变量“results”中。 I then loop the file names and read the csv and store it in list which I later concat to one dataframe.
然后我循环文件名并读取 csv 并将其存储在列表中,然后我将其连接到一个数据帧。
data = []
for item in result:
data.append(pd.read_csv(path))
main_df = pd.concat(data, axis = 0)
I am not saying this is the best approach, but this works great for me :)我并不是说这是最好的方法,但这对我很有用:)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.