简体   繁体   English

循环访问大数据集时提高性能

[英]Improving performance when looping in a big data set

I am making some spatio-temporal analysis (with MATLAB) on a quite big data set and I am not sure what is the best strategy to adopt in terms of performance for my script. 我正在使用相当大的数据集进行时空分析(使用MATLAB),但我不确定在脚本性能方面采用哪种最佳策略。

Actually, the data set is split in 10 yearly arrays of dimension (latitude,longitude,time)=(50,60,8760). 实际上,数据集被划分为10个年度维度数组(纬度,经度,时间)=(50,60,8760)。

The general structure of my analysis is: 我的分析的总体结构为:

 for iterations=1:Big Number  

  1. Select a specific site of spatial reference (i,j).   
  2. Do some calculation on the whole time series of site (i,j). 
  3. Store the result in archive array.

 end

My question is: 我的问题是:

Is it better (in terms of general performance) to have 有(在总体表现方面)更好吗

1) all data in big yearly (50,60,8760) arrays as global variables loaded for once. 1)以大的年度(50,60,8760)数组中的所有数据作为全局变量一次加载。 At each iteration the script will have to extract one particular "site" (i,j,:) from those arrays for data process. 在每次迭代中,脚本都必须从这些数组中提取一个特定的“站点” (i,j,:)进行数据处理。

2) 50*60 distinct files stored in a folder. 2)50 * 60个不同的文件存储在一个文件夹中。 Each file containing a particular site time series (a vector of dimension (Total time range,1)). 每个文件都包含一个特定的站点时间序列(维向量(总时间范围,1))。 The script will then have to open, data process and then close at each iteration a specific file from the folder. 然后,该脚本将必须打开,进行数据处理,然后在每次迭代时关闭文件夹中的特定文件。

Because your computations are computed on the entire time series, I would suggest storing the data that way in a 3000x8760 vector and doing the computations that way. 由于您的计算是在整个时间序列上进行的,因此建议将数据以这种方式存储在3000x8760向量中,然后以这种方式进行计算。

Your accesses then will be more cache-friendly. 这样,您的访问将对缓存更加友好。

You can reformat your data using the reshape function: 您可以使用reshape函数重新格式化数据:

newdata = reshape(olddata,50*60,8760);

Now, instead of accessing olddata(i,j,:) , you need to access newdata(sub2ind([50 60],i,j),:) . 现在,您无需访问olddata(i,j,:) ,而需要访问newdata(sub2ind([50 60],i,j),:)

After doing some experiments it is clear that the second proposition with 3000 distinct files is much slower than having to manipulate big arrays loaded in workspace. 经过一些实验后,很明显第二个命题包含3000个不同的文件比必须处理工作区中加载的大数组要慢得多。 But I didn't try to load all the 3000 files in workspace before computing (A tad to much). 但是在计算之前,我没有尝试将所有3000个文件加载到工作区中(一点点)。

It looks like Reshaping data help's a little bit. 看起来“重塑数据帮助”有点。

Thanks to all contributors for your suggestions. 感谢所有贡献者的建议。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM