简体   繁体   English

通过不匹配行得到一个新的 dataframe

[英]Get a new dataframe by not matching rows

I have two dataframes organised in this way:我有两个以这种方式组织的数据框:

df1
col_1 col_2 col_3
  a banana red 
  b apple blue
  c orange green
  d



df2
col_1 col_2 col_3
  a                                         
  b apple blue
  c orange green
  d

Both come from a dataframe that is complete in all its rows, the two dataframes above are the result of a filter that I applied in the column "col_2" and in the column "col_3" where I label NaN values everything that does not fit in the filter.两者都来自 dataframe,它的所有行都是完整的,上面的两个数据帧是我在“col_2”列和“col_3”列中应用的过滤器的结果,其中我 label NaN 值所有不适合的东西过滤器。 I would like to compare the two dataframes obtained from the filter by isolating the rows that get the "Nan" label once I have applied a wider filter.一旦我应用了更宽的过滤器,我想通过隔离获得“Nan”label 的行来比较从过滤器获得的两个数据帧。

Example of an expected result:预期结果示例:

[IN]df1.merge(df2, on= ["col_1", "col_2"])


[OUT] 
col_1 col_2 col_3
a banana red 

How can I do this?我怎样才能做到这一点? Thank you in advance for your reply预先感谢您的回复

Let me explain better:让我更好地解释一下:

When I re-apply the filter on df1 by raising the thresholds, the values of those two columns tend to decrease.当我通过提高阈值在 df1 上重新应用过滤器时,这两列的值往往会降低。 The original dataframe has about 50,000 rows without any null values.原始 dataframe 有大约 50,000 行,没有任何 null 值。 As I apply the filter to the original dataframe and raise the thresholds more and more, the null values in those two columns tend to increase, reducing the non-null values from 50,000 to 45,000 as I raise the thresholds.当我将过滤器应用于原始 dataframe 并越来越多地提高阈值时,这两列中的 null 值趋于增加,随着我提高阈值,非空值从 50,000 减少到 45,000。 I am particularly interested in getting those 5,000 values that I lost from the preceding dataframe before I applied the filter.我特别感兴趣的是在应用过滤器之前从前面的 dataframe 中获取丢失的那 5,000 个值。 That is my goal.那是我的目标。

I think you are trying to say: keep the rows that are unique (ie, do not exactly match in all columns) between the 2 dataframes.我想你是想说:在 2 个数据帧之间保留唯一的行(即,在所有列中不完全匹配)。 In that case, a way to do that is to combine the 2 dataframes, and then remove all duplicate rows.在这种情况下,一种方法是合并 2 个数据帧,然后删除所有重复的行。

import pandas as pd

df1 = pd.DataFrame({
    'c1': ['a', 'b', 'c', 'd'],
    'c2': ['banana', 'apple', 'orange', None],
    'c3': ['red', 'blue', 'green', None]})

df2 = pd.DataFrame({
    'c1': ['a', 'b', 'c', 'd'],
    'c2': [None, 'apple', 'orange', None],
    'c3': [None, 'blue', 'green', None]})

print(df1)
print()
print(df2)

df_all_together = pd.concat([df1, df2])
df_unique_rows = df_all_together.drop_duplicates(subset=['c2', 'c3'], keep=False)

print(40*'-')
print(df_unique_rows)

  c1      c2     c3
0  a  banana    red
1  b   apple   blue
2  c  orange  green
3  d    None   None

  c1      c2     c3
0  a    None   None
1  b   apple   blue
2  c  orange  green
3  d    None   None
----------------------------------------
  c1      c2   c3
0  a  banana  red

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM