简体   繁体   English

从多个文件中读取大数据并在python中聚合数据的最快方法是什么?

[英]What is the fastest way to read large data from multiple files and aggregate data in python?

I have many files: 1.csv, 2.csv ... N.csv.我有很多文件:1.csv、2.csv ... N.csv。 I want to read them all and aggregate a DataFrame.我想阅读它们并聚合一个 DataFrame。 But reading files sequentially in one process will definitely be slow.但是在一个过程中顺序读取文件肯定会很慢。 So how can I improve it?那么我该如何改进呢? Besides, Jupyter notebook is used.此外,还使用了 Jupyter 笔记本。

Also, I am a little confused about the "cost of parsing parameters or return values between python processes"另外,我对“python进程之间解析参数或返回值的成本”有点困惑

I know the question may be duplicated.我知道这个问题可能是重复的。 But I found that most of the answers use multi-process to solve it.但是我发现大部分答案都是用多进程来解决的。 Multiprocess does solve the GIL problem.多进程确实解决了GIL问题。 But in my experience(maybe it is wrong): parsing large data(like a DataFrame) as a parameter to subprocess is slower than a for loop in a single process because the procedure needs serializing and de-serializing.但根据我的经验(也许这是错误的):解析大数据(如 DataFrame)作为子进程的参数比单个进程中的 for 循环慢,因为该过程需要序列化和反序列化。 And I am not sure about the return of large values from the subprocess.而且我不确定从子流程返回的大值。

Is it most efficient to use a Qeueu or joblib or Ray ?使用QeueujoblibRay是否最有效?

Reading csv is fast.读取csv很快。 I would read all csv in a list and then concat the list to one dataframe.我会读取列表中的所有 csv,然后将列表连接到一个数据帧。 Here is a bit of code form my use case.这是我的用例的一些代码。 I find all .csv files in my path and save the csv file names in variable "results".我在我的路径中找到所有 .csv 文件,并将 csv 文件名保存在变量“results”中。 I then loop the file names and read the csv and store it in list which I later concat to one dataframe.然后我循环文件名并读取 csv 并将其存储在列表中,然后我将其连接到一个数据帧。

data = []
for item in result:
   data.append(pd.read_csv(path))
main_df = pd.concat(data, axis = 0)

I am not saying this is the best approach, but this works great for me :)我并不是说这是最好的方法,但这对我很有用:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Python中从大型Binary文件读取特定数据块的最快方法是什么 - What is the fastest way to read a specific chunk of data from a large Binary file in Python 读取文本列的大型数据文件的最快方法是什么? - What is the fastest way to read in a large data file of text columns? 在Python中读取和切片二进制数据文件的最快方法 - Fastest way to read in and slice binary data files in Python 从 python 中的多个文件读取和存储输入数据的有效方法是什么? - What is an efficient way to read and store input data from multiple files in python? 在Python中将具有多个功能的分类数据转换为数字的最快方法是什么? - What is the fastest way to convert categorical data with multiple features to numeric in Python? 在Python中从磁盘读取复杂数据结构的最快方法 - Fastest way to read complex data structures from disk in Python Python以最快的方式将大量小文件读入内存? - Python fastest way to read a large number of small files into memory? 在python中读取大数据文件的最快方法 - fastest method to read big data files in python 在 Python 中操作大型 csv 文件的最快方法是什么? - What is the fastest way to manipulate large csv files in Python? 循环浏览大量文件并保存数据图的最快/最有效方法是什么? - What is the fastest/most efficient way to loop through a large collection of files and save a plot of the data?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM