简体   繁体   English

需要帮助将合并函数从 R 转换为 Python,生成的 df 的形状相同但在删除重复项后在 Python 中丢失更多行

[英]Need help converting a merge function from R to Python, shape of resulting df is the same but losing more rows in Python after dropping duplicates

I believe the merge type in R is a left outer join.我相信 R 中的合并类型是左外连接。 The merge I implemented in Python returned a dataframe that had the same shape as the resulting merged df in R. Although when I had dropped the duplicates (df2.drop_duplicates), 4000 rows were dropped in Python as opposed to the 50 rows dropped when applying the drop duplicates function to the post-merge R data frame我在 Python 中实现的合并返回了一个数据帧,该数据帧与 R 中生成的合并 df 具有相同的形状。 尽管当我删除重复项 (df2.drop_duplicates) 时,Python 中删除了 4000 行,而不是应用时删除的 50 行删除重复函数到合并后的 R 数据框

The dataframe I need to merge are df1 and df2我需要合并的数据框是 df1 和 df2

R:
df2<-merge( df2[ , -which(names(df2) %in% c(column9,column10))], df1[,c(column1,column2,column4,column5)],by.x=c(column1,column2),by.y=c(column2,column4),all.x=T

Python:
df2 = df2[[column1,column2,column3...column8]].merge(df1[[column1,column2,column4,column5]],how='left',left_on=[column1,column2],right_on=[column2,column4]

df2[column1] and df2[column2] are the columns I want to merge on because their names in df1 are df1[column2] and df1[column4] but have the same row values. df2[column1] 和 df2[column2] 是我想要合并的列,因为它们在 df1 中的名称是 df1[column2] 和 df1[column4] 但具有相同的行值。

My gut tells me that the issue is stemming from this portion of the code that I might be misinterpreting: -which(names(df2) %in% c(column9,column10)我的直觉告诉我这个问题源于我可能误解的这部分代码: -which(names(df2) %in% c(column9,column10)

Please feel free to send some tips my way if I'm messing up somewhere如果我在某个地方搞砸了,请随时以我的方式发送一些提示

First, the list subset of columns in Pandas is no longer recommended .首先, 不再推荐Pandas 中列的列表子集。 Instead, use reindex to subset columns which handles missing labels.相反,使用reindex对处理缺失标签的列进行子集化。

And the R translation of -which(names(df2) %in% c(column9, column10)) in Pandas can be ~df2.columns.isin([column9, column10]) . Pandas 中-which(names(df2) %in% c(column9, column10))的 R 翻译可以是~df2.columns.isin([column9, column10]) And because isin returns a boolean series, to subset consider DataFrame.loc :并且因为isin返回一个布尔系列,要考虑DataFrame.loc子集:

df2 = (df.loc[:, ~df2.columns.isin([column9, column10])]
         .merge(df1.reindex([column1, column2, column4, column5], axis='columns'),
                how='left', 
                left_on=[column1, column2], 
                right_on=[column2, column4])
      )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM