简体   繁体   中英

Is there any way to split a SAS file of around 16GB into multiple files/dataframes in Python?

I have a raw SAS file that is around 16GB, and even after keeping columns relevant to my problem, the file size comes to around 8GB. It kind of looks like this:

CUST_ID   FIELD_1   FIELD_2   FIELD_3 ... FIELD_7
1          65         786      ABC          Y
2          87         785      GHI          N
3          88         877      YUI          Y
...
9999999    92         767      XYS          Y

When I tried to import it into Python using the code: df=pd.read_sas(path,format='SAS7BDAT') my screen turned black, and after multiple attempts I finally got the error MemoryError . Since I need the entire set of CUST_ID for my problem, selecting only a sample and deleting other rows is out of the question.

I thought maybe I could split this entire file into multiple sub-files so that I can carry out all the required calculations that I need to, and then finally reunite these files into a single large file after completing all necessary work.

Is there any way to solve this issue? I really appreciate all the help that I can get!

Edit:

I've tried this

chunk_list=[]
for chunk in df_chunk 
       chunk_filter=chunk
       chunk_list.append(chunk_filter)

df_concat=pd.concat(chunk_list)

But I'm still getting a Memory Error . Any help??

read_sas有一个chunksize参数,它应该允许您将大文件分成更小的部分,以便您可以读取它。 chunksize是一次读取的记录数。

Set the iterator flag to true and split the file in a loop before doing your processing.

Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sas.html

or split the file in SAS before doing the output.


I think what you are trying is the following:

CHUNK = 10
df=pd.read_sas(path,format='SAS7BDAT',chunksize = CHUNK)

for chunk in df:
  # perform compression
  # write it out of your memory onto disk to_csv('new_file',
    # mode='a', # append mode
    # header=False, # don't rewrite the header, you need to init the file with a header
    # compression='gzip') # this is more to save space on disk maybe not needed

df=pd.read_csv(new_file)

you could try to compress the data inside the loop because otherwise it will fail again when merging:

  1. Dropping columns
  2. Lower-range numerical dtype
  3. Categoricals
  4. Sparse columns

Ref: https://pythonspeed.com/articles/pandas-load-less-data/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM