简体   繁体   English

Pandas 的 Output 两个数据帧的合并不会产生预期的形状

[英]Output of a Pandas Merge of two data frames does not produce the expected shape

I am merging two data frames using a "Left Merge", however, the number of rows in the output does not equal the number of rows in the left data frame.我正在使用“左合并”合并两个数据帧,但是,output 中的行数不等于左数据帧中的行数。 I am expecting the shape of df_bd to be (58233, 10).我期望 df_bd 的形状为 (58233, 10)。

在此处输入图像描述

You have to duplicates when performing the Join, for instance:执行 Join 时必须重复,例如:

import pandas as pd
left_data = {'name':['John','Mark'],'value':[1,5]}
right_data = {'name':['John','Mark','John','Mark'],'children':['Celius','Stingher','Celius','Stingher'],'process_date':['2019-02-05','2019-02-05','2019-03-05','2019-03-05']}
left_df = pd.DataFrame(left_data)
right_df = pd.DataFrame(right_data)
right_df['process_date'] = pd.to_datetime(right_df['process_date'])

This is how they look like:它们是这样的:

print(left_df)
   name  value
0  John      1
1  Mark      5
print(right_df)
       name  children process_date
0  John    Celius   2019-02-05
1  Mark  Stingher   2019-02-05
2  John    Celius   2019-03-05
3  Mark  Stingher   2019-03-05

Even when the merge is left since there are multiple process_date values in the right_df , therefore the left dataframe will be duplicated in order to fit all values beings passed by the right dataframe.即使由于right_df中有多个process_date值而left合并,因此left dataframe 将被复制,以适合right dataframe 传递的所有值。

    df = left_df.merge(right_df,how='left',left_on='name',right_on='name')
    print(df)
   name  value  children process_date
0  John      1    Celius   2019-02-05
1  John      1    Celius   2019-03-05
2  Mark      5  Stingher   2019-02-05
3  Mark      5  Stingher   2019-03-05

One approach to filter this is to .sort_values() by an specific order and then .drop_duplicates(subset=list(left_df),keep={'last','first'}) .过滤它的一种方法是.sort_values()按特定顺序,然后.drop_duplicates(subset=list(left_df),keep={'last','first'}) This way, we are eliminating duplicates rows and keeping with most recent information available:通过这种方式,我们消除了重复行并保留了最新的可用信息:

df = df.sort_values('process_date',ascending=True).drop_duplicates(list(left_df),keep='last')
print(df)
   name  value  children process_date
1  John      1    Celius   2019-03-05
3  Mark      5  Stingher   2019-03-05

Length of the merged dataframe, matches length of left_df .合并 dataframe 的长度,匹配left_df的长度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM