简体   繁体   English

在合并列中合并具有多个相同值的两个数据框

[英]Merging two data frames having multiple same values in merging column

I have a df named SKU_df我有一个名为SKU_df的 df

merchant_SKU_filtered   uniqueCol
1313030 1313030_0
1409085 1409085_0
1338516 1338516_0
1409093 1409093_0
1409085 1409085_1
1415090 1415090_0
1490663 1490663_0
1490739 1490739_0
1490739 1490739_1
1491455 1491455_0
1490739 1490739_2
1492511 1492511_0
1492529 1492529_0
1571223 1571223_0
1492529 1492529_1
1571223 1571223_1
1571223 1571223_2
1572056 1572056_0
18718   18718_0
2000842 2000842_0
19749   19749_0
2007254 2007254_0
19749   19749_1
2024743 2024743_0
2107688 2107688_0
21505   21505_0
2124634 2124634_0
2166924 2166924_0
21419   21419_0
2422327 2422327_0
2508406 2508406_0
28046   28046_0
2690493 2690493_0
28046   28046_1
2690493 2690493_1
28046   28046_2
28639   28639_0
4064531 4064531_0
3002680 3002680_0
4262531 4262531_0
34363   34363_0
4369302 4369302_0
4369302 4369302_1
4587911 4587911_0
4500658 4500658_0
4591293 4591293_0
4569125 4569125_0
46810   46810_0

And another df named input_df .另一个 df 名为input_df

Merchant SKU,Quantity Per Box,NOB,Shipment Status,id_using_regex,prepped_by_initials
1313030 - Rit Dye Drk Grn 8oz 3pk,20,1,Complete,1313030 - Rit Dye Drk Grn 8oz 3pk,w
13296 - Minwax Wax Paste 16oz,45,1,Complete,13296 - Minwax Wax Paste 16oz,Vishal
1338516 - Qukrete Mortar Repair - 5pk,33,1,Complete,1338516 - Qukrete Mortar Repair - 5pk,w
1409085 - Howard Btchr Blck Cndtnr - 5pk,100,2,Complete,1409085 - Howard Btchr Blck Cndtnr - 5pk,w
1409093 - Howard Furniture Wax 3Pk,225,1,Complete,1409093 - Howard Furniture Wax 3Pk,w
1415090 - Werner Ladder Accessories,8,1,Complete,1415090 - Werner Ladder Accessories,w
1436872 - Whink Rust Remover 2Pk,1,1,Complete,1436872 - Whink Rust Remover 2Pk,P
1490663 - 3 pack,4,1,Complete,1490663 - 3 pack,w
1490739 - 6 pack,15,1,Complete,1490739 - 6 pack,A
1490739 - Loctite Blue 242 - 2 pack,23,1,Complete,1490739 - Loctite Blue 242 - 2 pack,B
1490739 - Loctite Blue 242 - 3 pack,99,1,Update AMZ Shipment,1490739 - Loctite Blue 242 - 3 pack,C
1491455 - Granite Gld Plsh Spry 3Pk,100,1,Update AMZ Shipment,1491455 - Granite Gld Plsh Spry 3Pk,w
1492511 - NP1 POLYSEAL WHITE,87,1,Complete,1492511 - NP1 POLYSEAL WHITE,w
1492529 - MasterSeal Sealant/Caulk 4Pk,30,2,Complete,1492529 - MasterSeal Sealant/Caulk 4Pk,w
1571223 - 2 pack,20,3,Complete,1571223 - 2 pack,w
1572056 - Method Dish Pump Refill,40,1,Complete,1572056 - Method Dish Pump Refill,w
1600667 - DAP All Prpse Adhsve 6Pk,22,1,Update AMZ Shipment,1600667 - DAP All Prpse Adhsve 6Pk,
18718 - FLOOD/PPG Additive 2Pk,22,1,Update AMZ Shipment,18718 - FLOOD/PPG Additive 2Pk,w
19749 - Titebond 5004 Prm Wd Glue - 2pk,11,1,Complete,19749 - Titebond 5004 Prm Wd Glue - 2pk,RH
19749 - Titebond II Wood Glue 2Pk,88,1,Complete,19749 - Titebond II Wood Glue 2Pk,RH
2000842 - Powerlock Tape Rule 2Pk,99,1,Complete,2000842 - Powerlock Tape Rule 2Pk,RH
2007254 - DEWALT Claw Hammer,77,1,Complete,2007254 - DEWALT Claw Hammer,RH
2024743 - Dico Nyalox Flap Brush 3Pk,22,1,Update AMZ Shipment,2024743 - Dico Nyalox Flap Brush 3Pk,w
2107688 - Stanley Ftmx Msrng Tpe,34,1,Update AMZ Shipment,2107688 - Stanley Ftmx Msrng Tpe,w
2124634 - Stanley Fat Max Knife,22,1,Update AMZ Shipment,2124634 - Stanley Fat Max Knife,w
21419 - Irwin 81107 No 7 Bit - 5pk,44,1,Update AMZ Shipment,21419 - Irwin 81107 No 7 Bit - 5pk,w
21505 - Irwin 60172 Drill Bit Stand,50,1,Update AMZ Shipment,21505 - Irwin 60172 Drill Bit Stand,RH
2166924 - Stanley Hook Knife,60,1,Update AMZ Shipment,2166924 - Stanley Hook Knife,RH
2422327 - Stanley Surform Round File,75,1,Complete,2422327 - Stanley Surform Round File,w
2508406 - Freud Pilot Bit - 5pk,76,1,Complete,2508406 - Freud Pilot Bit - 5pk,w
2690493 - STANLEY Hex Key Set,40,2,Complete,2690493 - STANLEY Hex Key Set,w
28046 - Arrow Fastener 276 - 12pk,90,1,Complete,28046 - Arrow Fastener 276 - 12pk,RH
28046 - Arrw Fstnr 276 Stpls - 10pk,55,1,Update AMZ Shipment,28046 - Arrw Fstnr 276 Stpls - 10pk,w
28046- Arrow 3/8 staples 2 pk,24,1,Complete,28046- Arrow 3/8 staples 2 pk,w
28639 - 2 pack,24,1,Complete,28639 - 2 pack,w
3002680 - Westinghouse Pull Chain Sckt,2,1,Complete,3002680 - Westinghouse Pull Chain Sckt,w
34363 - Carlon Switch & Outlet Box,24,1,Complete,34363 - Carlon Switch & Outlet Box,RH
4064531 - Korky Valve Rplcmnt,24,1,Update AMZ Shipment,4064531 - Korky Valve Rplcmnt,w
4262531 - Korky Flpper Rplaces Khler 3in,25,1,Update AMZ Shipment,4262531 - Korky Flpper Rplaces Khler 3in,w
4369302 - Korky Toilet Flapper 2Pk,34,1,Complete,4369302 - Korky Toilet Flapper 2Pk,w
4369302 - Korky Unvrsal 3in Flapper,23,1,Complete,4369302 - Korky Unvrsal 3in Flapper,w
4500658 - Enviro-Log Firestrtrs 2PK,12,1,Complete,4500658 - Enviro-Log Firestrtrs 2PK,RH
4569125,12,1,Complete,4569125,w
4587911 - Korky Fill Valve,12,1,Complete,4587911 - Korky Fill Valve,w
4591293 - Mansfield Flapper KIT,12,1,Complete,4591293 - Mansfield Flapper KIT,RH
46810 - Plyprpylne Hsng Wrnch,12,1,Update AMZ Shipment,46810 - Plyprpylne Hsng Wrnch,w

For some Mechant SKUs there are different values of prepped_by_initial .对于某些Mechant SKUs ,有不同的prepped_by_initial值。 So, after joining these dataframes, the values are getting messed up.因此,在加入这些数据帧之后,值变得一团糟。 I just want the prepped_by_intial column to be mapped on merchant_SKU_filtered .我只想将prepped_by_intial列映射到merchant_SKU_filtered

This is the code I've tried so far,这是我到目前为止尝试过的代码,

input_df['merchant_SKU_filtered'] = input_df['Merchant SKU'].str.split(' ').apply(lambda x: x[0])
input_df['merchant_SKU_filtered'] = input_df['merchant_SKU_filtered'].replace('-', '', regex=True)
input_df['merchant_SKU_filtered'] = input_df['merchant_SKU_filtered'].astype(str)
SKU_df['merchant_SKU_filtered'] = SKU_df['merchant_SKU_filtered'].astype(str)

suffix = input_df.groupby(input_df['merchant_SKU_filtered']).cumcount().astype(str)

keylist1 = list(SKU_df['merchant_SKU_filtered'])
dict_lookup1 = dict(zip(input['merchant_SKU_filtered'], input_df['prepped_by_initials']))
SKU_df['key1'] = [dict_lookup1[item] for item in keylist1]
SKU_df['key1'] = SKU_df['key1'].replace(np.nan, ' ', regex=True)

input_df['uniqueCol'] = input_df['merchant_SKU_filtered'] + '_' + suffix
key_list = list(SKU_df['uniqueCol'])
dict_lookup = dict(zip(SKU_df['uniqueCol'], input_df['prepped_by_initials']))
try:
    SKU_df['key2'] = SKU_df['uniqueCol'].map(dict_lookup)
except:
    print("Error")

SKU_df['prepped_by_initials'] = SKU_df['key2'].fillna(SKU_df['key1'])

WHich gives me a dataframe, although the values are prepped_by_initial are still not in order.这给了我一个 dataframe,尽管这些值是prepped_by_initial仍然没有按顺序排列。 For eg merchant_SKU_filtered value 1490739 should have values A , B and C .例如, merchant_SKU_filtered1490739应具有值ABC Albeit I'm getting w , A , and B that is values are not getting mapped correctly.尽管我得到的是wAB ,但这些值未正确映射。

Any suggestions?有什么建议么? Any help will be appreciated!!任何帮助将不胜感激!!

I had a chance to look into your code.我有机会查看您的代码。 The problem which causes wrong values eg 1490739 is the way you create your dict_lookups .导致错误值(例如1490739 )的问题是您创建dict_lookups的方式。 zip just put the 2 columns together row by row. zip 只是将 2 列逐行放在一起。 Your input of the zip has different length, so the mapping is wrong.您输入的zip长度不同,因此映射错误。

Your SKU_df is longer than the input_df and also different merchant numbers , what do you want to do with unique Numbers in SKU_df which aren't present in input_df (so they have no prepped value)?您的SKU_dfinput_df长,而且merchant numbers也不同,您想如何处理SKU_df中不存在input_df中的唯一编号(因此它们没有准备值)?

IIUC what you want to achieve you can do a pd.merge instead of building the lookup_dict and mapping them after. IIUC 你想要实现什么你可以做一个pd.merge而不是构建lookup_dict并在之后映射它们。

#extract Merchant Numbers in Input df as new column
input_df["merchant_SKU_filtered"] = (
    input_df["Merchant SKU"].str.split(" ").apply(lambda x: x[0])).replace(
    "-", "", regex=True).astype(str)

# add suffix to have unique Numbers (in case of duplicates)
suffix = input_df.groupby(input_df["merchant_SKU_filtered"]).cumcount().astype(str)
input_df["uniqueCol"] = input_df["merchant_SKU_filtered"] + "_" + suffix

SKU_df["merchant_SKU_filtered"] = SKU_df["merchant_SKU_filtered"].astype(str)

SKU_df.merge(input_df[["uniqueCol", "prepped_by_initials"]],
    on="uniqueCol",
    how="left")

print(SKU_df.head(20))

    merchant_SKU_filtered   uniqueCol   prepped_by_initials
0                 1313030   1313030_0   w
1                 1409085   1409085_0   w
2                 1338516   1338516_0   w
3                 1409093   1409093_0   w
4                 1409085   1409085_1   NaN
5                 1415090   1415090_0   w
6                 1490663   1490663_0   w
7                 1490739   1490739_0   A
8                 1490739   1490739_1   B
9                 1491455   1491455_0   w
10                1490739   1490739_2   C
11                1492511   1492511_0   w
12                1492529   1492529_0   w
13                1571223   1571223_0   w
14                1492529   1492529_1   NaN
15                1571223   1571223_1   NaN
16                1571223   1571223_2   NaN
17                1572056   1572056_0   w
18                18718     18718_0     w
19                2000842   2000842_0   RH
20                19749     19749_0     RH

As you can see for your example number 1490739 the mapping is right.正如您在示例编号1490739中看到的那样,映射是正确的。 If you check the NaN rows you won't find these uniqueCol values in the input_df如果您检查NaN行,您将不会在input_df uniqueCol

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM