简体   繁体   English

如何处理格式为 SAS7DBAT 文件的大于 30 GiB 的大型数据集?

[英]How do I work with large, >30 GiB datasets that are formatted as SAS7DBAT files?

I have these 30 GiB SAS7BDAT files which correspond to a year's worth of data.我有这 30 个 GiB SAS7BDAT 文件,它们对应于一年的数据。 When I try importing the file using pd.read_sas() I get a memory-related error.当我尝试使用 pd.read_sas() 导入文件时,出现与内存相关的错误。 Upon research, I hear mentions of using Dask, segmenting the files into smaller chunks, or SQL.经过研究,我听说使用 Dask,将文件分割成更小的块,或 SQL。 These answers sound pretty broad, and since I'm new, I don't really know where to begin.这些答案听起来很广泛,而且由于我是新手,我真的不知道从哪里开始。 Would appreciate if someone could share some details with me.如果有人可以与我分享一些细节,将不胜感激。 Thanks.谢谢。

I am not aware of a partitioned loader of this sort of data for dask.我不知道用于 dask 的此类数据的分区加载程序。 However, the pandas API apparently allows you to stream the data by chunks, so you could write these chunks to other files in any convenient format, and then process those either serially or with dask.但是,pandas API显然允许您按块将数据写入 stream,因此您可以将这些块以任何方便的格式写入其他文件,然后以任何方便的格式将这些块写入其他文件,然后处理这些块。 The best value of chunksize will depend on your data and available memory. chunksize 的最佳值将取决于您的数据和可用的 memory。

The following should work, but I don't have any of this sort of data to try it on.以下应该有效,但我没有任何此类数据可供尝试。

with pd.read_sas(..., chunksize=100000) as file_reader:
    for i, df in enumerate(file_reader):
        df.to_parquet(f"{i}.parq")

then you can load the parts (in parallel) with然后你可以加载零件(并行)

import dask.dataframe as dd
ddf = dd.read_parquet("*.parq")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM