繁体   English   中英

如何提高此查找程序的效率?

[英]How can I improve efficiency in this lookup program?

我有两个大型数据集-搜索为340,000 x 1,字段为348,000 x2。我的目标是在搜索中使用元素,在field(:,1)中找到其位置,然后在field(: ,2)创建一个名为result的新单元格数组。

我直接使用cellfun耗尽了内存,因此我不得不将数据集拆分为子集,然后编译结果。

为此,我构建了以下程序,但这要花费非常长的时间:2小时40分钟!

我的问题是,如何才能更有效地执行此任务? 我需要修改现有代码还是需要采用完全不同的方法来解决问题?

function result = bigdatacmp(search,field)

%BIGDATACMP(SEARCH,FIELD) takes strcmp jobs that require excessive amounts
%   memory and splits them up into manageable subsets. The results of the
%   subsets are then compiled to represent the original set.


tic

subsets = floor(size(search,1)/1000);       %Divides search into subsets
difference = size(search,1) - 1000*subsets; %# of elements in last subset

result = cell(0);                           %Establish empty variables

%Loops through all subsets. Finds location of matches in the first column
%of field. Compiles subset locations. Compiles results from second column
%of field.
for i = 1:subsets

    searchvalues = search(1000*i-999:1000*i);

    Zlogic = cellfun(@(x)(strcmp(x,field(:,1))),...
        search(1000*i-999:1000*i),'UniformOutput',false);

    result(1000*i-999:1000*i) = cellfun(@(x)(field(x,2)),...
        Zlogic,'UniformOutput',false);
end

%Performs same calculations as in loop, but for the final subset.
Zlogic = cellfun(@(x)(strcmp(x,field(:,1))),search(size(search,1)-...
    difference+1:size(search,1)),'UniformOutput',false);

result(end+1:end+difference) = cellfun(@(x)(field(x,2)),Zlogic,...
    'UniformOutput',false);

result = result';

toc
end

348k并不那么大。 考虑构建一个containers.Map对象映射到从field的第一列到第二列的对应条目的Map对象。 这样一来,您就无需为搜索中的每个条目进行详尽的field search

[编辑添加:]如果输入的总数为348k,我认为没有必要进一步拆分。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM