简体   繁体   English

用Python为table1的每个值读取table2的最快方法?

[英]fastest way to read a table2 for each value of a table1 with Python?

I want to compute a matching or a marriage. 我想计算匹配或婚姻。 I have, say, a men basket in table2, and women in table1. 我在表2中有一个男人篮子,在表1中有一个女人篮子。 For each row of table1, I want to evaluate a distance for that particular individual idx1 with all rows in table2. 对于table1的每一行,我想针对该特定个体idx1与table2中的所有行评估距离。 Then I select the argmin idx2, and save somewhere that idx1 is matched with idx2. 然后,我选择argmin idx2,并将idx1与idx2匹配的位置保存。 Algorithm is not finished because I want to remove idx2 from the men basket (table2) before going to next idx1. 算法尚未完成,因为我想在进入下一个idx1之前将其从男篮(table2)中删除。 The distance is a function of variables from table1 and table2, typically, score_str = '(table2[age] - table1[age][idx1])^2 + (table2[diploma] - table1[diploma][idx1])^2' ( table1[varname][idx1] becomes temp[varname] in the code below) 该距离是来自表1和表2的变量的函数,通常,score_str ='(table2 [age]-table1 [age] [idx1])^ 2 +(table2 [diploma]-table1 [diploma] [idx1])^ 2 '(在下面的代码中,table1 [varname] [idx1]变为temp [varname])

Because, I use pandas, I wrote the following code, but it take 15 seconds to match around 2000 men and 2000 women. 因为,我使用熊猫,所以编写了以下代码,但匹配大约2000名男性和2000名女性需要15秒。 I'm not sure the use of pandas is an advantage here. 我不确定使用熊猫在这里是否有优势。 I may have to change. 我可能不得不改变。 Computing time is important in that case as I'll match much much bigger databases (around 2 millions). 在这种情况下,计算时间很重要,因为我要匹配更大的数据库(大约200万个)。

Edit: first comments are right it's a quadratic algorithm so it will take time anyway and the size = 2,000,000 will surely still a dream. 编辑:第一个评论是正确的,它是一个二次算法,因此无论如何都将花费时间,并且大小= 2,000,000肯定仍然是一个梦想。 An other stage will be to split the big dataset in smaller chunks (but that will be done with an economist point of view). 另一个阶段是将大型数据集分成较小的块(但这将以经济学家的观点完成)。 The faster is the algorithm, the biggest the chunks can be, so it's still important for me to improve computing time. 算法越快,块越大就越多,因此对我来说,提高计算时间仍然很重要。

Thanks for any help. 谢谢你的帮助。

import pandas as pd
import pdb
import numpy as np

size = 5000 
score_str = "(table2['age']-temp['age'])**2 +  5*(table2['diploma']-temp['diploma'])"

table2 = pd.DataFrame(np.random.randn(size, 2), columns=['age','diploma'])
table1 = pd.DataFrame(np.random.randn(size, 2), columns=['age','diploma'])

match = pd.Series(0, index=table1.index)
index2 = pd.Series(True, index=table2.index)  
k_max = min(len(table2), len(table1))
def matching():
    for k in xrange(k_max):   
        temp = table1.iloc[k] 
        score = eval(score_str)[index2]
        idx2 = score.idxmax()
        match.iloc[k] = idx2 # print( k, 0, index2)
        index2[idx2] = False

    return match

matching()

Edit : rather than the idea of RussW, I've translated my code from pandas to numpy. 编辑:不是将RussW的想法,而是将我的代码从熊猫翻译为numpy。 It's the first small step to a lower-level language, isn't it ? 这是使用低级语言的第一步,不是吗? That way my simulation is for time faster. 这样,我的仿真速度更快。 With n=2,000,000 the calculus lasts seven hours. 当n = 2,000,000时,演算将持续七个小时。 In my world (microeconomics) it's look like a reasonable time. 在我的世界(微观经济学)中,这似乎是一个合理的时间。

def run_time_np(n):
    table2 = np.random.randint(0,100, [n,2])
    table1 = np.random.randint(0,100, [n,2])
    idx2 = np.array([np.arange(n)])
    table2 = np.concatenate((table2, idx2.T), axis=1)

    match = np.empty(n, dtype=int)
    k_max = min(len(table2), len(table1))
    score_str = "(table2[:,0]-temp[0])**2 +  5*(table2[:,1]-temp[1])"
    k_max = min(len(table2), len(table1))
    start = time.clock()
    for k in xrange(k_max):   
        temp = table1[k]
        score = eval(score_str)
        idx = score.argmax()
        idx2 = table2[score.argmax(),2]
        match[k] = idx2 
        table2 = np.delete(table2, idx, 0)
    print 'taille: ',n,' ; temps de calcul: ', time.clock()-start
    return match

You should certainly use a profiler to see where the code is spending time. 您当然应该使用探查器来查看代码在哪里花费时间。 You'll be able to see if panda is slowing down. 您将能够查看熊猫的速度是否变慢。 Also as far as I understand your algorithm, it is quadratic O(n^2). 而且据我了解您的算法,它是二次O(n ^ 2)。 I don't think you'll be able to run it in a reasonable time for a 2 million size table. 我认为您无法在合理的时间内运行200万个大小的表。

For every individual in one table, you'll be comparing that individual with every individual in the other table with a function (age difference)^2 + (diploma difference)^2 . 对于一个表中的每个人,您将通过函数(age difference)^2 + (diploma difference)^2将该人与另一张表中的每个人进行比较。

One idea to reduce the number of operations is to use groups/buckets to first find a minimal set of other individuals, and then compare with that minimal set using the same function to find a match. 减少操作数量的一个想法是使用组/存储桶首先找到其他个人的最小集合,然后使用相同的功能与该最小集合进行比较以找到匹配项。

Take a table and make 2 new tables, age_groups and dip_groups . 制作一张桌子,制作2个新桌子age_groupsdip_groups In age_groups you would have age buckets (which can be done with a dict ) with keys such as (20, 25) -> (minage, maxage) . 在age_groups中,您会使用(20, 25) -> (minage, maxage)键来使用年龄段(可以用dict完成(20, 25) -> (minage, maxage) The same for dip_groups. 对于dip_groups同样。

Then you algorithm would look like this: (pseudocode) 然后您的算法将如下所示:(伪代码)

for individual in table1:
    age, diploma = individual
    for age_bucket, dip_bucket in iterate_buckets(age, diploma):
        matches = age_bucket.intersection(dip_bucket)
        if matches:
            match = get_best_match(matches, age, diploma)
            all_matches.append((individual, match))
            remove_individual(age_groups, match)
            remove_individual(dip_groups, match)

The main thing would be the iterate_buckets() and get_best_match() functions. 最主要的是iterate_buckets()get_best_match()函数。

age_groups = [(18, 20), (21, 23), (24, 26), ... ]
dip_groups = [(1, 2), (3, 4), (5, 6) ... ]
group_combinations = [(ag, dg) for ag in age_groups for dp in dip_groups]

def iter_key(age_grp, dip_group, age, dip):
    age_av = sum(age_grp) / 2.0
    dip_av = sum(dip_grp) / 2.0
    return pow(age - age_av, 2) + pow(dip - dip_av, 2)

def iterate_buckets(age, dip):
    combs = sorted(group_combinations, key=lambda grp: iter_key(*grp, age, dip))
    for c in combs:
        yield c

def match_key(indiv1, indiv2):
    age1, dip1 = indiv1
    age2, dip2 = indiv2
    return pow(age1 - age2, 2) + pow(dip1 - dip2, 2)

def get_best_match(matches, age, dip):
    sorted_matches = sorted(key=match_key, zip(matches, [(age, dip)] * len(matches)))
    return sorted_matches[0]

Just an idea, I'm not 100% sure it would be faster or that it would produce the same desired result. 只是一个想法,我不是100%会确定它会更快还是会产生相同的预期结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM