[英]Fastest way to read large files text files in Pandas Dataframe
I have several large files (> 4 gb each).我有几个大文件(每个 > 4 GB)。 Some of them are in a fixed width format and some are pipe delimited.
其中一些是固定宽度格式,一些是 pipe 分隔的。 The files have both numeric and text data.
这些文件既有数字数据又有文本数据。 Currently I am using the following approach:
目前我正在使用以下方法:
df1 = pd.read_fwf(fwFileName, widths = [2, 3, 5, 2, 16],
names = columnNames, dtype = columnTypes,
skiprows = 1, engine = 'c',
keep_default_na = False)
df2 = pd.read_csv(pdFileName, sep = '|', names = columnNames,
dtype = columnTypes, useCols = colNumbers,
skiprows = 1, engine = 'c',
keep_default_na = False)
However, this seems to be slower than for example R's read_fwf (from readr) and fread (from data.table).但是,这似乎比 R 的 read_fwf(来自 readr)和 fread(来自 data.table)慢。 Can I use some other methods that will help speed up reading these files?
我可以使用其他一些有助于加快读取这些文件的方法吗?
I am working on a big server with several cores so memory is not an issue.我正在使用具有多个内核的大型服务器,因此 memory 不是问题。 I can safely load the whole files into memory.
我可以安全地将整个文件加载到 memory 中。 Maybe they are the same thing in this case but my goal is to optimize on time and not on resources.
也许在这种情况下它们是相同的,但我的目标是按时间而不是资源进行优化。
Update更新
Based on the comments so far here are a few additional details about the data and my ultimate goal.根据到目前为止的评论,这里有一些关于数据和我的最终目标的额外细节。
Since we are taking time as a metric here, then your memory size is not the main factor we should be looking at, actually on the contrary all methods using lazy loading(less memory and only load objects when needed) are much much faster than loading all data at once in memory, you can check out dask
as it provides such lazy read function.由于我们在这里将时间作为衡量标准,因此您的 memory 大小不是我们应该关注的主要因素,实际上相反,所有使用延迟加载的方法(更少 memory 并且仅在需要时加载对象)比加载快得多memory 中的所有数据一次,您可以查看
dask
,因为它提供了这样的惰性读取 function。 https://dask.org/ https://dask.org/
start_time = time.time()
data = dask.dataframe.read_csv('rg.csv')
duration = time.time() - start_time
print(f"Time taken {duration} seconds") # less than a second
But as I said this won't load data in memory, but rather load only portions of data when needed, you could however load it in full using:但是正如我所说,这不会在 memory 中加载数据,而是在需要时仅加载部分数据,但是您可以使用以下方法完全加载它:
data.compute()
If you want to load things faster in memory, then you need to have good computing capabilities in your server, a good candidate that could benefit from such capabilities is ParaText
https://github.com/wiseio/paratext You can benchmark ParaText against readcsv using the following code:如果您想在 memory 中更快地加载内容,那么您需要在您的服务器中具有良好的计算能力,可以从这些能力中受益的一个很好的候选人是
ParaText
https://github.com/wiseio/paratext您可以将 ParaText 与 readcsv 进行基准测试使用以下代码:
import time
import paratext
start_time = time.time()
df = paratext.load_csv_to_pandas("rg.csv")
duration = time.time() - start_time
print(f"Time taken {duration} seconds")
import time
import pandas as pd
start_time = time.time()
df = pd.read_csv("rg.csv")
duration = time.time() - start_time
print(f"Time taken {duration} seconds")
Please Note that results may be worse if you don't have enough compute power to support paraText
.请注意,如果您没有足够的计算能力来支持
paraText
,结果可能会更糟。 You can check out the benchmarks for ParaText
loading large files here https://deads.gitbooks.io/paratext-bench/content/results_csv_throughput.html .您可以在此处查看
ParaText
加载大文件的基准https://deads.gitbooks.io/paratext-bench/content/results_csv_throughput.html 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.