I have a raw SAS file that is around 16GB, and even after keeping columns relevant to my problem, the file size comes to around 8GB. It kind of looks like this:
CUST_ID FIELD_1 FIELD_2 FIELD_3 ... FIELD_7
1 65 786 ABC Y
2 87 785 GHI N
3 88 877 YUI Y
...
9999999 92 767 XYS Y
When I tried to import it into Python using the code: df=pd.read_sas(path,format='SAS7BDAT')
my screen turned black, and after multiple attempts I finally got the error MemoryError
. Since I need the entire set of CUST_ID
for my problem, selecting only a sample and deleting other rows is out of the question.
I thought maybe I could split this entire file into multiple sub-files so that I can carry out all the required calculations that I need to, and then finally reunite these files into a single large file after completing all necessary work.
Is there any way to solve this issue? I really appreciate all the help that I can get!
Edit:
I've tried this
chunk_list=[]
for chunk in df_chunk
chunk_filter=chunk
chunk_list.append(chunk_filter)
df_concat=pd.concat(chunk_list)
But I'm still getting a Memory Error
. Any help??
read_sas
有一个chunksize
参数,它应该允许您将大文件分成更小的部分,以便您可以读取它。 chunksize
是一次读取的记录数。
Set the iterator flag to true and split the file in a loop before doing your processing.
Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sas.html
or split the file in SAS before doing the output.
I think what you are trying is the following:
CHUNK = 10
df=pd.read_sas(path,format='SAS7BDAT',chunksize = CHUNK)
for chunk in df:
# perform compression
# write it out of your memory onto disk to_csv('new_file',
# mode='a', # append mode
# header=False, # don't rewrite the header, you need to init the file with a header
# compression='gzip') # this is more to save space on disk maybe not needed
df=pd.read_csv(new_file)
you could try to compress the data inside the loop because otherwise it will fail again when merging:
Ref: https://pythonspeed.com/articles/pandas-load-less-data/
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.