I have 12 csv files with total size 8.45 GB. I would like to read all csv files into pasdas dataframe with read_csv.
I tried using this code
# Example of 3 files
list = ['file-01.csv',
'file-02.csv',
'file-03.csv']
li = []
for filename in list:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
concat_df = pd.concat(li, axis=0, ignore_index=True)
Then it showed
MemoryError: Unable to allocate 784. MiB for an array with shape (1, 102804250) and data type int64
How can I solve this issue?
Thanks,
This may not be possible, however, you will get a much more memory efficient process if you use a generator rather than appending to a list and concatenating. Creating a list and concatenating will require about twice as much memory as a generator.
Try this:
#please don't use `list` as a variable name.
file_list = [
'file-01.csv',
'file-02.csv',
'file-03.csv'
]
def yield_dfs(file_list):
"""generator function to yield dataframes."""
for file_name in file_list:
df = pd.read_csv(file_name)
# you may be able to reduce the memory requirements by doing some pre-processing of the dataframe here. e.g. convert strings to booleans so save memore.
yield df
df = pd.concat(yield_dfs(file_list))
I didn't run that code to check for syntax errors and the specifics may vary a little depending on the paths.
If you have enough system memory for the grand DataFrame, that is pretty likely to work. However, you are talking about a very big dataframe and it depends a lot on the datatypes your are working with.
files = ['file-01.csv',
'file-02.csv',
'file-03.csv']
df = pd.read_csv('file-01.csv', index_col=0)
for file in files[1:]:
df_i = pd.read_csv(file, index_col=0)
df = pd.concat((df, df_i), axis=0)
df.reset_index(drop=True, inplace=True)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.