繁体   English   中英

Python / Pandas:如何在多个列的匹配项上进行联接

[英]Python/Pandas: How to do a join on matches across multiple columns

问题:

我想使用更复杂的联接条件将两个表联接在一起。 一个表具有3个可能的电话号码,另一表具有2个可能的电话号码。 我不知道每一行的主要电话号码。 因此,我希望基于这样的标准加入:可以在第二个表的ANIDNIS列中找到第一个表的numberPhone1Phone2列中的电话号码。

样本数据:

一个DataFrame看起来像这样...

                           application_uuid      number      Phone1      Phone2
0      b7754a2e-84be-4aec-a04e-0eba93dca5d8  5196942368         NaN         NaN
1      6ca3f0c3-0c83-4ebd-afe3-23977f1c6608  6475219092         NaN         NaN
2      3b5a083e-7765-4f27-941d-d2b4cbd6f26a  6476256563         NaN         NaN
3      229fee54-437f-4812-abec-7034fcb9a655        None         NaN         NaN
4      866a2cd2-5628-4e6b-b649-d92e2f0585ce  7092164418  7096391545  7092164977
5      8259410d-8d3d-4381-a0b3-6d6ce67b0917  6476387217  6476387217  6475313526
6      c359b03b-5e5f-4d4e-a5b0-ee37ac90c292        None         NaN         NaN
7      d70414a9-8fd9-4d1d-a77d-17f06743fd00  7054987969         NaN         NaN
8      0452edf9-2d58-4ad5-b1e2-0621ac517104  6136219401         NaN         NaN
9      cb3ab85c-fd42-4aff-a9b8-1743565b31e6        None         NaN         NaN
10     563e3e4d-e59a-4afc-b804-91aa14de919d  7056582202  7056582202  7056584200
11     3dd1df61-a36f-490b-ac15-225a83a21551        None  7096899998  7096899998
12     6bc42df3-e869-4794-a595-e3238ccf5284  5873415009         NaN         NaN
13     8bf11117-038f-4d2d-b4c6-9b2c6423d626  6473435642         NaN         NaN
14     0a854fe5-af66-40b0-b202-3e9367dc5a75  6478594204         NaN         NaN
15     b5884de8-2e0c-4b38-a3fd-7911cf4840b1  7787075288  7787075288  7787075288
16     f74cf212-cff0-48cc-b210-539dcdcccf72  7802676838  7806678567        None
17     9bffe5bf-b5d8-4e74-b4c9-9f1b5b238af3        None         NaN         NaN
18     dce91c00-a1ea-4111-a6ee-5ff5fd0cfb5f  6476093140         NaN         NaN
19     29cd024e-2c51-4682-b274-809c3cfb2b2b        None         NaN         NaN
20     ec55317b-fc20-416a-b26d-e95300f89c79        None         NaN         NaN
21     b3d00cd8-9d8e-415e-99b1-d8944e7b31e1        None         NaN         NaN
22     b3328787-edb7-4e08-a76c-370a74135fba        None         NaN         NaN
23     c8baf235-e702-41db-b4f8-8c2bf38109bf        None         NaN         NaN
24     cd9179bc-0594-4d25-9d7f-ddf6671777e2  7802428155         NaN         NaN
25     370855c0-b3fa-4d87-8d54-b84d34e7f35f        None         NaN         NaN
26     82244e78-3802-4890-96f6-e5267172f0e9        None         NaN         NaN
27     c7b0054c-29ac-4c76-bc5d-8cdbc93f5157  7052093358  7055268791  7052093358
28     d90e6e87-f7ef-43e1-9c85-35572fae838c  4039696044         NaN         NaN
29     bdd2474f-f4be-402b-8672-d73da90d7066        None         NaN         NaN

另一个DataFrame看起来像这样...

        CALL ID CALL TYPE         ANI          DNIS TALK TIME
0        615262   Inbound  6479246923  8.557236e+09  00:00:00
1        615263   Inbound  5196519186  8.557236e+09  00:00:00
2        615264   Inbound  7095679350  8.557236e+09  00:00:00
3        615265   Inbound  7095679350  8.557236e+09  00:00:00
4        615266   Inbound  7095679350  8.557236e+09  00:00:00
5        615267   Inbound  7095679350  8.557236e+09  00:00:00
6        615268   Inbound  7095679350  8.557236e+09  00:00:00
7        615269   Inbound  7095679350  8.557236e+09  00:00:00
8        615270   Inbound  7095679350  8.557236e+09  00:00:00
9        615271   Inbound  7095679350  8.557236e+09  00:00:00
10       615272   Inbound  4035634231  8.557236e+09  00:00:00
11       615273   Inbound  7095679350  8.557236e+09  00:00:00
12       615274   Inbound  7095679350  8.557236e+09  00:00:00
13       615275   Inbound  7095679350  8.557236e+09  00:00:00
14       615276   Inbound  7095679350  8.557236e+09  00:00:00
15       615277   Inbound  7095679350  8.557236e+09  00:00:00
16       615278   Inbound  7095679350  8.557236e+09  00:00:00
17       615279   Inbound  9057972416  8.557236e+09  00:00:00
18       615280   Inbound  9057972416  8.557236e+09  00:00:00
19       615281   Inbound  9057972416  8.557236e+09  00:00:00
20       615282    Manual  8557235626  8.005635e+09  00:00:11
21       615283   Inbound  9057972416  8.557236e+09  00:00:00
22       615284   Inbound  4169991603  8.557236e+09  00:00:00
23       615285    Manual  8557235626  4.162977e+09  00:01:05
24       615286    Manual  8557235626  8.002569e+09  00:00:55
25       615287   Inbound  4169967207  8.557236e+09  00:07:48
26       615288   Inbound  4169788047  8.557236e+09  00:01:29
27       615289   Inbound  9057972416  8.557236e+09  00:01:39
28       615290   Inbound  8002568964  8.557236e+09  00:04:21
29       615291    Manual  8557235626  7.059751e+09  00:00:19

我的方法:

我的方法是将每一行中的电话号码作为一个单独的列添加到列表中。 然后,我创建了一个搜索功能。 这种方式不切实际,不优雅并且太慢。

def f(row):
    phone_numbers_59 = phone_data['Number'].tolist()
    callid = phone_data['CALL ID'].tolist()

    get_callid = []
    for i in range(0, len(phone_numbers_59)):
        if any([x in phone_numbers_59[i] for x in row['Numbers']]):
            get_callid.append(callid[i])

    if len(get_callid) > 0:
        return get_callid
    else:
        return "NA"

s = data.apply(f, axis=1)

numberPhone1Phone2可以在ANIDNIS

如果一次满足一个条件,那么会更简单(并且不要在Python中编写for循环很大,如您所见,这很慢):

for col in ('ANI', 'DNIS'):
    right = df2.set_index(col, drop=False)
    df1 = df1.join(right, 'number', rsuffix='_num_'+col)
    df1 = df1.join(right, 'Phone1', rsuffix='_p1_'+col)
    df1 = df1.join(right, 'Phone2', rsuffix='_p2_'+col)

这样做是将列添加到df1六次:每个组合一次。 rsuffix用于消除列名的歧义。 您可能最终Phone1多个匹配项(也许Phone1匹配ANIPhone2匹配DNIS ),在这种情况下,由您决定如何解析或组合它们(可能使用groupby() )。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM