在 Pandas Dataframe 中读取大文件文本文件的最快方法

Question

I have several large files (> 4 gb each).我有几个大文件（每个 > 4 GB）。 Some of them are in a fixed width format and some are pipe delimited.其中一些是固定宽度格式，一些是 pipe 分隔的。 The files have both numeric and text data.这些文件既有数字数据又有文本数据。 Currently I am using the following approach:目前我正在使用以下方法：

df1 = pd.read_fwf(fwFileName, widths = [2, 3, 5, 2, 16], 
                 names = columnNames, dtype = columnTypes,
                 skiprows = 1, engine = 'c', 
                 keep_default_na = False)
df2 = pd.read_csv(pdFileName, sep = '|', names = columnNames, 
                 dtype = columnTypes, useCols = colNumbers, 
                 skiprows = 1, engine = 'c', 
                 keep_default_na = False)

However, this seems to be slower than for example R's read_fwf (from readr) and fread (from data.table).但是，这似乎比 R 的 read_fwf（来自 readr）和 fread（来自 data.table）慢。 Can I use some other methods that will help speed up reading these files?我可以使用其他一些有助于加快读取这些文件的方法吗？

I am working on a big server with several cores so memory is not an issue.我正在使用具有多个内核的大型服务器，因此 memory 不是问题。 I can safely load the whole files into memory.我可以安全地将整个文件加载到 memory 中。 Maybe they are the same thing in this case but my goal is to optimize on time and not on resources.也许在这种情况下它们是相同的，但我的目标是按时间而不是资源进行优化。

Update更新

Based on the comments so far here are a few additional details about the data and my ultimate goal.根据到目前为止的评论，这里有一些关于数据和我的最终目标的额外细节。

These files are compressed (fixed width are zip and pipe delimited are gzip).这些文件被压缩（固定宽度是 zip 和 pipe 分隔是 gzip）。 Therefore, I am not sure if things like Dask will add value for loading.因此，我不确定像 Dask 这样的东西是否会增加加载的价值。 Will they?他们会吗？
After loading these files, I plan to apply a computationally expensive function to groups of data.加载这些文件后，我计划将计算成本高的 function 应用于数据组。 Therefore, I need the whole data.因此，我需要全部数据。 Although the data is sorted by groups, ie first x rows are group 1, next y rows are group 2 and so on.虽然数据是按组排序的，即前 x 行是第 1 组，接下来的 y 行是第 2 组，依此类推。 Therefore, forming groups on the fly might be more productive?因此，即时组建小组可能更有效率？ Is there an efficient way of doing that, given that I don't know how many rows to expect for each group?有没有一种有效的方法来做到这一点，因为我不知道每个组有多少行？

Answer 1

Since we are taking time as a metric here, then your memory size is not the main factor we should be looking at, actually on the contrary all methods using lazy loading(less memory and only load objects when needed) are much much faster than loading all data at once in memory, you can check out dask as it provides such lazy read function.由于我们在这里将时间作为衡量标准，因此您的 memory 大小不是我们应该关注的主要因素，实际上相反，所有使用延迟加载的方法（更少 memory 并且仅在需要时加载对象）比加载快得多memory 中的所有数据一次，您可以查看dask ，因为它提供了这样的惰性读取 function。 https://dask.org/ https://dask.org/

start_time = time.time() 
data = dask.dataframe.read_csv('rg.csv') 
duration = time.time() - start_time
print(f"Time taken {duration} seconds") # less than a second

But as I said this won't load data in memory, but rather load only portions of data when needed, you could however load it in full using:但是正如我所说，这不会在 memory 中加载数据，而是在需要时仅加载部分数据，但是您可以使用以下方法完全加载它：

data.compute()

If you want to load things faster in memory, then you need to have good computing capabilities in your server, a good candidate that could benefit from such capabilities is ParaText https://github.com/wiseio/paratext You can benchmark ParaText against readcsv using the following code:如果您想在 memory 中更快地加载内容，那么您需要在您的服务器中具有良好的计算能力，可以从这些能力中受益的一个很好的候选人是ParaText https://github.com/wiseio/paratext您可以将 ParaText 与 readcsv 进行基准测试使用以下代码：

import time
import paratext
start_time = time.time() 
df = paratext.load_csv_to_pandas("rg.csv") 
duration = time.time() - start_time
print(f"Time taken {duration} seconds")

import time
import pandas as pd
start_time = time.time() 
df = pd.read_csv("rg.csv") 
duration = time.time() - start_time
print(f"Time taken {duration} seconds")

Please Note that results may be worse if you don't have enough compute power to support paraText .请注意，如果您没有足够的计算能力来支持paraText ，结果可能会更糟。 You can check out the benchmarks for ParaText loading large files here https://deads.gitbooks.io/paratext-bench/content/results_csv_throughput.html .您可以在此处查看ParaText加载大文件的基准https://deads.gitbooks.io/paratext-bench/content/results_csv_throughput.html 。

在 Pandas Dataframe 中读取大文件文本文件的最快方法

问题描述

1 个解决方案

解决方案1
0 2020-04-14 04:37:30

在 Pandas Dataframe 中读取大文件文本文件的最快方法

问题描述

1 个解决方案

解决方案1 0 2020-04-14 04:37:30

解决方案1
0 2020-04-14 04:37:30