导入大型数据文件的有效方法，Python

Question

I'm working on a project that gets data from two NetCDF files, each of which is 521.8 MB. 我正在研究一个从两个NetCDF文件获取数据的项目，每个文件的大小为521.8 MB。 Admittedly, these are fairly large files. 诚然，这些都是相当大的文件。 I am working on a MacBook Pro which has 4 GB of memory on it, but the computer is approximately 4 years old. 我正在使用内存为4 GB的MacBook Pro，但是计算机的使用年限约为4年。 Code is written in Python. 代码是用Python编写的。

The files contains a year's worth of weather data across the Earth. 这些文件包含一年中整个地球的气象数据。 It is a 4D array that contains the time (length 1460), altitude (length 17), latitude (length 73), and longitude (length 144). 它是一个4D数组，其中包含时间（长度1460），高度（长度17），纬度（长度73）和经度（长度144）。 I only need certain portions of that information at a time. 我一次只需要这些信息的某些部分。 Specifically, I need all of the time, but only one altitude level, and only a particular region of latitude and longitude (20x44). 具体来说，我需要所有时间，但只需要一个海拔高度，并且只需要特定的经纬度区域（20x44）。

I had code that gathered all of this data from both files, identified only the data I needed, performed calculations, and output the data into a text file. 我的代码从两个文件中收集了所有这些数据，仅识别了我需要的数据，执行了计算，然后将数据输出到文本文件中。 Once done with that year, it looped through 63 years of data, which is 126 files of equivalent size. 在完成该年的工作后，它遍历了63年的数据，即126个同等大小的文件。 Now, the code says it runs out of memory right at the beginning of the process. 现在，代码说它在过程开始时就耗尽了内存。 The relevant code seems to be: 相关代码似乎是：

from mpl_toolkits.basemap.pupynere import NetCDFFile

#Create the file name for the input data.
ufile="Flow/uwnd."+str(time)+".nc"
vfile="Flow/vwnd."+str(time)+".nc"

#Get the data from that particular file.
uu=NetCDFFile(ufile)
vv=NetCDFFile(vfile)

#Save the values into an array (will be 4-dimentional)
uwnd_short=uu.variables['uwnd'][:]
vwnd_short=vv.variables['vwnd'][:]

So, the first section creates the name of the NetCDF files. 因此，第一部分将创建NetCDF文件的名称。 The second section gets all the data from the NetCDF files. 第二部分从NetCDF文件中获取所有数据。 The third section takes the imported data and places it into 4D arrays. 第三部分将导入的数据放入4D数组中。 (This may not technically be an array because of how Python works with the data, but I have thought of it as such due to my C++ background. Apologies for lack of proper vocabulary.) Later on, I separate out the specific data I need from the 4D array and perform necessary calculations. （由于Python如何处理数据，从技术上讲，这可能不是一个数组，但由于我的C ++背景，我认为是这样。由于缺乏适当的词汇而深表歉意。）稍后，我分离出所需的特定数据从4D数组中进行计算。 The trouble is that this used to work, but now my computer runs out of memory while working on the vv=NetCDFFile(vfile) line. 麻烦在于这曾经可以工作，但是现在我的计算机在vv=NetCDFFile(vfile)行上运行时内存vv=NetCDFFile(vfile) 。

Is there a possible memory leak somewhere? 某处可能存在内存泄漏？ Is there a way to only get the specific range of data I need so I'm not bringing in the entire file? 有没有一种方法只能获取我需要的特定数据范围，因此我不会导入整个文件？ Is there a more efficient way to go from bringing the data in to sorting out the section of data I need to performing calculations with it? 从引入数据到整理出我需要的数据部分再进行计算，是否有更有效的方法？

Answer 1

What you probably need to do is rechunk the files using nccopy, then process the chunks, since some of the variables seem to large to fit in memory. 您可能需要做的是使用nccopy重新整理文件，然后处理这些块，因为某些变量似乎很大以适合内存。 That or get more memory (or virtual memory.) 那或获得更多的内存（或虚拟内存）。

nccopy docs are here http://www.unidata.ucar.edu/software/netcdf/docs/guide_nccopy.html nccopy文档在这里http://www.unidata.ucar.edu/software/netcdf/docs/guide_nccopy.html

Answer 2

For what it's worth, I did wind up having too much data on my computer and was running out of memory. 不管它的价值是多少，我确实在计算机上存储了太多数据，并且内存不足。 I got my external hard drive to work, and removed a bunch of files. 我使我的外部硬盘驱动器可以工作，并删除了一堆文件。 Then, I ended up figuring out how to use ncgen, ncdump, etc. I was able to get out of each large file only the data I needed and create a new file with only that data in it. 然后，我最终弄清楚了如何使用ncgen，ncdump等。我能够仅从每个大文件中取出所需的数据，并创建一个仅包含该数据的新文件。 This reduced my NetCDF files from 500MB to 5MB. 这将我的NetCDF文件从500MB减少到5MB。 That made the code much quicker to run as well. 这使得代码也可以更快地运行。

导入大型数据文件的有效方法，Python

问题描述

2 个解决方案

解决方案1
3 2013-08-06 16:49:39

解决方案2
0 2014-08-09 18:30:15

导入大型数据文件的有效方法，Python

问题描述

2 个解决方案

解决方案1 3 2013-08-06 16:49:39

解决方案2 0 2014-08-09 18:30:15

解决方案1
3 2013-08-06 16:49:39

解决方案2
0 2014-08-09 18:30:15