I use the following code to read a LARGE CSV file (6-10 GB), insert a header text, and then export it to CSV a again.
df = read_csv('read file')
df.columns =['list of headers']
df.to_csv('outfile',index=False,quoting=csv.QUOTE_NONNUMERIC)
But this methodology is extremely slow and I run out of memory. Any suggestions?
sorry I don't have enough reputation to comment, so I'll leave an answer. first, would you try to add low_memory parameter when you read the file? ( https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html )
df = read_csv('read file', low_memory=False)
second, how about try to check the memory usage using info()
df = read_csv('read file')
df.columns =['list of headers']
print(df.info())
third, based on Mohit's suggestion,
# set chunk size to split the big file per chunk size when read it in memory
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
#do process with chunk as your file content
Rather than reading in the whole 6GB file, could you not just add the headers to a new file, and then cat
in the rest? Something like this:
import fileinput
columns = ['list of headers']
columns.to_csv('outfile.csv',index=False,quoting=csv.QUOTE_NONNUMERIC)
with FileInput(files=('infile.csv')) as f:
for line in f:
outfile.write(line)
outfile.close()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.