用Python为table1的每个值读取table2的最快方法？

Question

我想计算匹配或婚姻。 我在表2中有一个男人篮子，在表1中有一个女人篮子。 对于table1的每一行，我想针对该特定个体idx1与table2中的所有行评估距离。 然后，我选择argmin idx2，并将idx1与idx2匹配的位置保存。 算法尚未完成，因为我想在进入下一个idx1之前将其从男篮（table2）中删除。 该距离是来自表1和表2的变量的函数，通常，score_str ='（table2 [age]-table1 [age] [idx1]）^ 2 +（table2 [diploma]-table1 [diploma] [idx1]）^ 2 '（在下面的代码中，table1 [varname] [idx1]变为temp [varname]）

因为，我使用熊猫，所以编写了以下代码，但匹配大约2000名男性和2000名女性需要15秒。 我不确定使用熊猫在这里是否有优势。 我可能不得不改变。 在这种情况下，计算时间很重要，因为我要匹配更大的数据库（大约200万个）。

编辑：第一个评论是正确的，它是一个二次算法，因此无论如何都将花费时间，并且大小= 2,000,000肯定仍然是一个梦想。 另一个阶段是将大型数据集分成较小的块（但这将以经济学家的观点完成）。 算法越快，块越大就越多，因此对我来说，提高计算时间仍然很重要。

谢谢你的帮助。

import pandas as pd
import pdb
import numpy as np

size = 5000 
score_str = "(table2['age']-temp['age'])**2 +  5*(table2['diploma']-temp['diploma'])"

table2 = pd.DataFrame(np.random.randn(size, 2), columns=['age','diploma'])
table1 = pd.DataFrame(np.random.randn(size, 2), columns=['age','diploma'])

match = pd.Series(0, index=table1.index)
index2 = pd.Series(True, index=table2.index)  
k_max = min(len(table2), len(table1))
def matching():
    for k in xrange(k_max):   
        temp = table1.iloc[k] 
        score = eval(score_str)[index2]
        idx2 = score.idxmax()
        match.iloc[k] = idx2 # print( k, 0, index2)
        index2[idx2] = False

    return match

matching()

编辑：不是将RussW的想法，而是将我的代码从熊猫翻译为numpy。 这是使用低级语言的第一步，不是吗？ 这样，我的仿真速度更快。 当n = 2,000,000时，演算将持续七个小时。 在我的世界（微观经济学）中，这似乎是一个合理的时间。

def run_time_np(n):
    table2 = np.random.randint(0,100, [n,2])
    table1 = np.random.randint(0,100, [n,2])
    idx2 = np.array([np.arange(n)])
    table2 = np.concatenate((table2, idx2.T), axis=1)

    match = np.empty(n, dtype=int)
    k_max = min(len(table2), len(table1))
    score_str = "(table2[:,0]-temp[0])**2 +  5*(table2[:,1]-temp[1])"
    k_max = min(len(table2), len(table1))
    start = time.clock()
    for k in xrange(k_max):   
        temp = table1[k]
        score = eval(score_str)
        idx = score.argmax()
        idx2 = table2[score.argmax(),2]
        match[k] = idx2 
        table2 = np.delete(table2, idx, 0)
    print 'taille: ',n,' ; temps de calcul: ', time.clock()-start
    return match

Answer 1

您当然应该使用探查器来查看代码在哪里花费时间。 您将能够查看熊猫的速度是否变慢。 而且据我了解您的算法，它是二次O（n ^ 2）。 我认为您无法在合理的时间内运行200万个大小的表。

Answer 2

对于一个表中的每个人，您将通过函数(age difference)^2 + (diploma difference)^2将该人与另一张表中的每个人进行比较。

减少操作数量的一个想法是使用组/存储桶首先找到其他个人的最小集合，然后使用相同的功能与该最小集合进行比较以找到匹配项。

制作一张桌子，制作2个新桌子age_groups和dip_groups 。 在age_groups中，您会使用(20, 25) -> (minage, maxage)键来使用年龄段（可以用dict完成(20, 25) -> (minage, maxage) 。 对于dip_groups同样。

然后您的算法将如下所示：（伪代码）

for individual in table1:
    age, diploma = individual
    for age_bucket, dip_bucket in iterate_buckets(age, diploma):
        matches = age_bucket.intersection(dip_bucket)
        if matches:
            match = get_best_match(matches, age, diploma)
            all_matches.append((individual, match))
            remove_individual(age_groups, match)
            remove_individual(dip_groups, match)

最主要的是iterate_buckets()和get_best_match()函数。

age_groups = [(18, 20), (21, 23), (24, 26), ... ]
dip_groups = [(1, 2), (3, 4), (5, 6) ... ]
group_combinations = [(ag, dg) for ag in age_groups for dp in dip_groups]

def iter_key(age_grp, dip_group, age, dip):
    age_av = sum(age_grp) / 2.0
    dip_av = sum(dip_grp) / 2.0
    return pow(age - age_av, 2) + pow(dip - dip_av, 2)

def iterate_buckets(age, dip):
    combs = sorted(group_combinations, key=lambda grp: iter_key(*grp, age, dip))
    for c in combs:
        yield c

def match_key(indiv1, indiv2):
    age1, dip1 = indiv1
    age2, dip2 = indiv2
    return pow(age1 - age2, 2) + pow(dip1 - dip2, 2)

def get_best_match(matches, age, dip):
    sorted_matches = sorted(key=match_key, zip(matches, [(age, dip)] * len(matches)))
    return sorted_matches[0]

只是一个想法，我不是100％会确定它会更快还是会产生相同的预期结果。

用Python为table1的每个值读取table2的最快方法？

问题描述

2 个解决方案

解决方案1
0 2013-07-25 09:10:33

解决方案2
0 已采纳 2013-07-25 10:11:58

用Python为table1的每个值读取table2的最快方法？

问题描述

2 个解决方案

解决方案1 0 2013-07-25 09:10:33

解决方案2 0 已采纳 2013-07-25 10:11:58

解决方案1
0 2013-07-25 09:10:33

解决方案2
0 已采纳 2013-07-25 10:11:58