简体   繁体   English

如何在Erlang的大列表中查找“最近”值

[英]How to find “nearest” value in a large list in Erlang

Suppose I have a large collection of integers (say 50,000,000 of them). 假设我有大量的整数(比如说有50,000,000个)。

I would like to write a function that returns me the largest integer in the collection that doesn't exceed a value passed as a parameter to the function. 我想编写一个函数,该函数返回集合中最大的整数,且该整数不超过作为参数传递给该函数的值。 Eg if the values were: 例如,如果值是:

 Values = [ 10, 20, 30, 40, 50, 60]

then find(Values, 25) should return 20. 然后find(Values, 25)应该返回20。

The function will be called many times a second and the collection is large. 该函数将每秒被调用多次,并且集合很大。 Assuming that the performance of a brute-force search is too slow, what would be an efficient way to do it? 假设蛮力搜索的性能太慢,什么是有效的方法? The integers would rarely change, so they can be stored in a data structure that would give the fastest access. 整数很少会更改,因此可以将它们存储在最快访问速度的数据结构中。

I've looked at gb_trees but I don't think you can obtain the "insertion point" and then get the previous entry. 我看过gb_trees,但我认为您无法获得“插入点”并获得上一个条目。

I realise I could do this from scratch by building my own tree structure, or binary chopping a sorted array, but is there some built-in way to do it that I've overlooked? 我意识到我可以通过构建自己的树结构或二进制切碎排序后的数组来从头开始执行此操作,但是是否有一些内置的方法可以做到这一点,而我却忽略了这一点?

To find nearest value in large unsorted list I'd suggest you to use divide and conquer strategy - and process different parts of list in parallel . 为了在未排序的大型列表中找到最接近的值,我建议您使用分治策略-并并行处理列表的不同部分 But enough small parts of list may be processed sequentially . 但是列表的足够小的部分可以顺序处理

Here is code for you: 这是给您的代码:

-module( finder ).
-export( [ nearest/2 ] ).

-define( THRESHOLD, 1000 ).

%%
%% sequential finding of nearest value
%%
%% if nearest value doesn't exists - return null
%%
nearest( Val, List ) when length(List) =< ?THRESHOLD ->
        lists:foldl(
                fun
                ( X, null ) when X < Val ->
                        X;
                ( _X, null ) ->
                        null;
                ( X, Nearest ) when X < Val, X > Nearest ->
                        X;
                ( _X, Nearest ) ->
                        Nearest
                end,
                null,
                List );
%%
%% split large lists and process each part in parallel
%%
nearest( Val, List ) ->
        { Left, Right } = lists:split( length(List) div 2, List ),
        Ref1 = spawn_nearest( Val, Left ),
        Ref2 = spawn_nearest( Val, Right ),
        Nearest1 = receive_nearest( Ref1 ),
        Nearest2 = receive_nearest( Ref2 ),
        %%
        %% compare nearest values from each part
        %%
        case { Nearest1, Nearest2 } of
                { null, null } ->
                        null;
                { null, Nearest2 } ->
                        Nearest2;
                { Nearest1, null } ->
                        Nearest1;
                { Nearest1, Nearest2 } when Nearest2 > Nearest1 ->
                        Nearest2;
                { Nearest1, Nearest2 } when Nearest2 =< Nearest1 ->
                        Nearest1
        end.

spawn_nearest( Val, List ) ->
        Ref = make_ref(),
        SelfPid = self(),
        spawn(
                fun() ->
                        SelfPid ! { Ref, nearest( Val, List ) }
                end ),
        Ref.

receive_nearest( Ref ) ->
        receive
                { Ref, Nearest } -> Nearest
        end.

在此处输入图片说明

Testing in shell: 在外壳中测试:

1> c(finder).
{ok,finder}
2> 
2> List = [ random:uniform(1000) || _X <- lists:seq(1,100000) ].
[444,724,946,502,312,598,916,667,478,597,143,210,698,160,
 559,215,458,422,6,563,476,401,310,59,579,990,331,184,203|...]
3> 
3> finder:nearest( 500, List ).
499
4>
4> finder:nearest( -100, lists:seq(1,100000) ).
null
5> 
5> finder:nearest( 40000, lists:seq(1,100000) ).
39999
6> 
6> finder:nearest( 4000000, lists:seq(1,100000) ).
100000

Performance: (single node) 性能:(单节点)

7> 
7> timer:tc( finder, nearest, [ 40000, lists:seq(1,10000) ] ). 
{3434,10000}
8> 
8> timer:tc( finder, nearest, [ 40000, lists:seq(1,100000) ] ).
{21736,39999}
9>
9> timer:tc( finder, nearest, [ 40000, lists:seq(1,1000000) ] ).
{314399,39999}

Versus plain iterating: 与普通迭代:

1> 
1> timer:tc( lists, foldl, [ fun(_X, Acc) -> Acc end, null, lists:seq(1,10000) ] ).
{14994,null}
2> 
2> timer:tc( lists, foldl, [ fun(_X, Acc) -> Acc end, null, lists:seq(1,100000) ] ).
{141951,null}
3>
3> timer:tc( lists, foldl, [ fun(_X, Acc) -> Acc end, null, lists:seq(1,1000000) ] ).
{1374426,null}

So, yo may see, that on list with 1000000 elements, function finder:nearest is faster than plain iterating through list with lists:foldl . 因此,您可能会看到,在具有1000000个元素的列表上,功能finder:nearest比通过lists:foldl遍历列表快得多

You may find optimal value of THRESHOLD in your case. 在您的情况下,您可能会发现THRESHOLD最佳值。

Also you may improve performance, if spawn processes on different nodes. 如果在不同节点上生成进程,则还可以提高性能。

Here is another code sample that uses ets. 这是另一个使用ets的代码示例。 I believe a lookup would be made in about constant time: 我相信将在大约恒定的时间内进行查找:

1> ets:new(tab,[named_table, ordered_set, public]).
2> lists:foreach(fun(N) -> ets:insert(tab,{N,[]}) end, lists:seq(1,50000000)).
3> timer:tc(fun() -> ets:prev(tab, 500000) end).
{21,499999}
4> timer:tc(fun() -> ets:prev(tab, 41230000) end).
{26,41229999}

The code surrounding would be a bit more than this of course but it is rather neat 当然,周围的代码还不止于此,但它相当简洁

So if the input isn't sorted, you can get a linear version by doing: 因此,如果输入未排序,则可以通过执行以下操作获得线性版本:

closest(Target, [Hd | Tl ]) ->
        closest(Target, Tl, Hd).

closest(_Target, [], Best) -> Best;
closest(Target, [ Target | _ ], _) -> Target;
closest(Target, [ N | Rest ], Best) ->
    CurEps = erlang:abs(Target - Best),
    NewEps = erlang:abs(Target -  N),
    if NewEps < CurEps ->
            closest(Target, Rest, N);
       true ->
            closest(Target, Rest, Best)
    end.

You should be able to do better if the input is sorted. 如果对输入进行排序,您应该能够做得更好。

I invented my own metric for 'closest' here as I allow the closest value to be higher than the target value - you could change it to be 'closest but not greater than' if you liked. 我在这里发明了自己的“最接近”指标,因为我允许最接近值大于目标值-如果愿意,您可以将其更改为“最接近但不大于”。

In my opinion, if you have a huge collection of data that does not change often, you shoud think about organize it. 我认为,如果您拥有大量不经常更改的数据,则应考虑组织起来。 I have wrote a simple one based on ordered list, including insertion an deletion functions. 我写了一个简单的基于有序列表的列表,其中包括插入删除功能。 It gives good results for both inserting and searching. 它为插入和搜索都提供了良好的结果。

-module(finder).

-export([test/1,find/2,insert/2,remove/2,new/0]).

-compile(export_all).

new() -> [].

insert(V,L) -> 
    {R,P} = locate(V,L,undefined,-1),
    insert(V,R,P,L).

find(V,L) -> 
    locate(V,L,undefined,-1).

remove(V,L) ->  
    {R,P} = locate(V,L,undefined,-1),
    remove(V,R,P,L).

test(Max) -> 
    {A,B,C} = erlang:now(),
    random:seed(A,B,C),
    L = lists:seq(0,100*Max,100),
    S = random:uniform(100000000),
    I = random:uniform(100000000),
    io:format("start insert at ~p~n",[erlang:now()]),
    L1 = insert(I,L),
    io:format("start find at ~p~n",[erlang:now()]),
    R = find(S,L1),
    io:format("end at ~p~n result is ~p~n",[erlang:now(),R]).

remove(_,_,-1,L) -> L;
remove(V,V,P,L) ->
    {L1,[V|L2]} = lists:split(P,L),
    L1 ++ L2;
remove(_,_,_,L) ->L.

insert(V,V,_,L) -> L;
insert(V,_,-1,L) -> [V|L];
insert(V,_,P,L) ->
    {L1,L2} = lists:split(P+1,L),
    L1 ++ [V] ++ L2.

locate(_,[],R,P) -> {R,P};
locate (V,L,R,P) -> 
    %% io:format("locate, value = ~p, liste = ~p, current result = ~p, current pos = ~p~n",[V,L,R,P]),
    {L1,[M|L2]} = lists:split(Le1 = (length(L) div 2), L),
    locate(V,R,P,Le1+1,L1,M,L2).

locate(V,_,P,Le,_,V,_) -> {V,P+Le};
locate(V,_,P,Le,_,M,L2) when V > M -> locate(V,L2,M,P+Le);
locate(V,R,P,_,L1,_,_) -> locate(V,L1,R,P).

which give the following results 得到以下结果

(exec@WXFRB1824L)6> finder:test(10000000). (exec @ WXFRB1824L)6> finder:test(10000000)。

start insert at {1347,28177,618000} 从{1347,28177,618000}开始插入

start find at {1347,28178,322000} 从{1347,28178,322000}开始查找

end at {1347,28178,728000} 结束于{1347,28178,728000}

result is {72983500,729836} 结果是{72983500,729836}

that is 704ms to insert a new value in a list of 10 000 000 elements and 406ms to find the nearest value int the same list. 在10000万个元素的列表中插入新值需要704毫秒,在同一列表中找到最接近的值需要406毫秒。

I tried to have a more accurate information about the performance of the algorithm I proposed above, an reading the very interesting solution of Stemm, I decide to use the tc:timer/3 function. 我试图获得有关我上面提出的算法性能的更准确的信息,阅读了Stemm非常有趣的解决方案后,我决定使用tc:timer / 3函数。 Big deception :o). 大欺骗:o)。 On my laptop, I got a very bad accuracy of the time. 在笔记本电脑上,我的时间精度很差。 So I decided to left my corei5 (2 cores * 2 threads) + 2Gb DDR3 + windows XP 32bit to use my home PC: Phantom (6 cores) + 8Gb + Linux 64bit. 因此,我决定离开我的corei5(2个内核* 2个线程)+ 2Gb DDR3 + Windows XP 32bit来使用我的家用PC:Phantom(6个内核)+ 8Gb + Linux 64bit。

Now tc:timer works as expected, I am able to manipulate lists of 100 000 000 integers. 现在tc:timer可以按预期工作,我可以处理1亿个整数的列表。 I was able to see that I was loosing a lot of time calling at each step the length function, so I re-factored the code a little to avoid it: 我能够看到我浪费了很多时间在每一步调用length函数,因此我对代码进行了一些重构以避免它:

-module(finder).

-export([test/2,find/2,insert/2,remove/2,new/0]).

%% interface

new() -> {0,[]}.

insert(V,{S,L}) -> 
    {R,P} = locate(V,L,S,undefined,-1),
    insert(V,R,P,L,S).

find(V,{S,L}) -> 
    locate(V,L,S,undefined,-1).

remove(V,{S,L}) ->  
    {R,P} = locate(V,L,S,undefined,-1),
    remove(V,R,P,L,S).

remove(_,_,-1,L,S) -> {S,L};
remove(V,V,P,L,S) ->
    {L1,[V|L2]} = lists:split(P,L),
    {S-1,L1 ++ L2};
remove(_,_,_,L,S) ->{S,L}.

%% local

insert(V,V,_,L,S) -> {S,L};
insert(V,_,-1,L,S) -> {S+1,[V|L]};
insert(V,_,P,L,S) ->
    {L1,L2} = lists:split(P+1,L),
    {S+1,L1 ++ [V] ++ L2}.

locate(_,[],_,R,P) -> {R,P};
locate (V,L,S,R,P) -> 
    S1 = S div 2,
    S2 = S - S1 -1,
    {L1,[M|L2]} = lists:split(S1, L),
    locate(V,R,P,S1+1,L1,S1,M,L2,S2).

locate(V,_,P,Le,_,_,V,_,_) -> {V,P+Le};
locate(V,_,P,Le,_,_,M,L2,S2) when V > M -> locate(V,L2,S2,M,P+Le);
locate(V,R,P,_,L1,S1,_,_,_) -> locate(V,L1,S1,R,P).

%% test

test(Max,Iter) -> 
    {A,B,C} = erlang:now(),
    random:seed(A,B,C),
    L = {Max+1,lists:seq(0,100*Max,100)},
    Ins = test_insert(L,Iter,[]),
    io:format("insert:~n~s~n",[stat(Ins,Iter)]),
    Fin = test_find(L,Iter,[]),
    io:format("find:~n ~s~n",[stat(Fin,Iter)]).

test_insert(_L,0,Res) -> Res;
test_insert(L,I,Res) ->
    V = random:uniform(1000000000),
    {T,_} = timer:tc(finder,insert,[V,L]),
    test_insert(L,I-1,[T|Res]).

test_find(_L,0,Res) -> Res;
test_find(L,I,Res) ->
    V = random:uniform(1000000000),
    {T,_} = timer:tc(finder,find,[V,L]),
    test_find(L,I-1,[T|Res]).

stat(L,N) ->
    Aver = lists:sum(L)/N,
    {Min,Max,Var} = lists:foldl(fun (X,{Mi,Ma,Va}) -> {min(X,Mi),max(X,Ma),Va+(X-Aver)*(X-Aver)} end, {999999999999999999999999999,0,0}, L),
    Sig = math:sqrt(Var/N),
    io_lib:format("   average: ~p,~n   minimum: ~p,~n   maximum: ~p,~n   sigma   : ~p.~n",[Aver,Min,Max,Sig]).

Here are some results. 这是一些结果。

1> finder:test(1000,10). 1> finder:test(1000,10)。 insert: 插入:

average: 266.7, 平均:266.7,

minimum: 216, 最低:216,

maximum: 324, 最大值:324,

sigma : 36.98121144581393. sigma:36.98121144581393。

find: 找:

 average: 136.1, 

minimum: 105, 最低:105,

maximum: 162, 最大值:162,

sigma : 15.378231367748375. sigma:15.378231367748375。

ok

2> finder:test(100000,10). 2> finder:test(100000,10)。

insert: 插入:

average: 10096.5, 平均:10096.5,

minimum: 9541, 最低:9541,

maximum: 12222, 最大值:12222,

sigma : 762.5642595873478. sigma:762.5642595873478。

find: 找:

 average: 5077.4, 

minimum: 4666, 最低:4666,

maximum: 6937, 最大值:6937,

sigma : 627.126494417195. sigma:627.126494417195。

ok

3> finder:test(1000000,10). 3> finder:test(1000000,10)。

insert: 插入:

average: 109871.1, 平均:109871.1,

minimum: 94747, 最低:94747,

maximum: 139916, 最多:139916,

sigma : 13852.211285206417. sigma:13852.211285206417。

find: average: 40428.0, 发现:平均:40428.0,

minimum: 31297, 最低:31297,

maximum: 56965, 最多:56965,

sigma : 7797.425562325042. sigma:7797.425562325042。

ok

4> finder:test(100000000,10). 4> finder:test(100000000,10)。

insert: 插入:

average: 8067547.8, 平均:8067547.8,

minimum: 6265625, 最少:6265625,

maximum: 16590349, 最多:16590349,

sigma : 3199868.809140206. sigma:3199868.809140206。

find: 找:

 average: 8484876.4, 

minimum: 5158504, 最少:5158504,

maximum: 15950944, 最大值:15950944,

sigma : 4044848.707872872. sigma:4044848.707872872。

ok

On the 100 000 000 list, it is slow, and the multi process solution cannot help on this dichotomy algorithm... It is a weak point of this solution, but if you have several processes in parallel requesting to find a nearest value, it will be able to use the multicore anyway. 在100000000列表上,它很慢,并且多进程解决方案无法解决这种二分算法...这是该解决方案的弱点,但是如果您有多个并行处理的进程要求找到一个最接近的值,则仍然可以使用多核。

Pascal. 帕斯卡。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM