简体   繁体   English

合并 pandas 不起作用,看起来像 concat

[英]Merge pandas doesn't work, it looks like concat

I've been working with two dataframes (info_clients and metadata_clients) both have an user_id and id_wp column as associated key , respectively, and I loeaded info_clients into sql table and get the PK associated, then merge these dfs on user_id (by the left side) and id_wp (on the right).我一直在使用两个数据帧(info_clients 和 metadata_clients),它们分别有一个user_idid_wp作为关联键,我将 info_clients 加载到 sql 表中并获取关联的 PK,然后在 user_id 上合并这些 dfs(在左侧) 和 id_wp (在右边)。

info_clients: (72232, 1) info_clients: (72232, 1)

 user_id
0       0
1       1
2       4
3       5
4   39784

metadata_clients: (72232, 2) metadata_clients: (72232, 2)

        id  id_wp
0  1158426      0
1  1158427      1
2  1158428      4
3  1158429      5
4  1158430  39784

I used this:我用这个:

merge = pd.merge( info_clients, metadata_clients, left_on=['user_id'], 
                            right_on=['id_wp'], how='left')

But it doesn't work as I expected, I had this result:但它并没有像我预期的那样工作,我得到了这个结果:

  user_id  cliente_fk  id_wp
0       0     1158426      0
1       1     1158427      1
2       4     1158428      4
3       5     1158429      5
4   39784     1158430  39784
Datamerge shape: (126680, 3)

When I save the info_clients data into sql table, I verify these data and I have 72232 clients saved.当我将 info_clients 数据保存到 sql 表中时,我验证了这些数据并保存了72232 个客户端。 I don't have nulls or nan values, I cleaned the data and checked its dtypes, both keys are int64.我没有空值或 nan 值,我清理了数据并检查了它的 dtypes,两个键都是 int64。

You have a situation where you have duplicates:您有重复的情况:

No, I don't have duplicates, I removed in a previoust step, using:不,我没有重复,我在之前的步骤中删除了,使用:
data.drop_duplicates(keep='first')

I don't know if data is your first ( info_clients ) or your second ( metadata_clients ) but if you drop duplicates without set a subset of columns, it's likely you have no duplicate on entire row.我不知道data是您的第一个( info_clients )还是您的第二个( metadata_clients ),但是如果您删除重复项而不设置列的子集,则很可能整行都没有重复项。 You should try:你应该试试:

data = data.drop_duplicates('user_id', keep='first')

# OR

data = data.drop_duplicates('wp_id', keep='first')

You should try to debug with value_counts :您应该尝试使用value_counts进行调试:

data.value_counts('user_id')

# OR

data.value_counts('wp_id')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM