简体   繁体   English

查找与另一个数据帧中的列具有相同非唯一列值的数据帧行

[英]Find rows of a dataframe that have same non-unique column values as a column in another dataframe

I have two dataframes- OK_df and Not_OK_df :我有两个数据帧- OK_df 和 Not_OK_df :

OK_df = pd.DataFrame({'type_id' : [1,2,3,3], 'count' : [2,7,2,5], 'unique_id' : ['1|2','2|7','3|2','3|5'], 'status' : ['OK','OK','OK','OK']})
Not_OK_df = pd.DataFrame({'type_id' : [1,3,5,6,3,3,3,1], 'count' : [1,1,1,1,3,4,6,3], 'col3' : [1,5,7,3,4,7,2,2], 'unique_id' : ['1|1','3|1','5|1','6|1','3|3','3|4','3|6','1|3'], 'status' : ['Not_OK','Not_OK','Not_OK','Not_OK','Not_OK','Not_OK','Not_OK','Not_OK']})

Ok_df: Ok_df:

       type_id  count unique_id status
0        1      2       1|2     OK
1        2      7       2|7     OK
2        3      2       3|2     OK
3        3      5       3|5     OK

Not_OK_df: Not_OK_df:

  type_id  count  col3 unique_id  status
0        1      1     1       1|1  Not_OK
1        3      1     5       3|1  Not_OK
2        5      1     7       5|1  Not_OK
3        6      1     3       6|1  Not_OK
4        3      3     4       3|3  Not_OK
5        3      4     7       3|4  Not_OK
6        3      6     2       3|6  Not_OK
7        1      3     2       1|3  Not_OK

where,在哪里,

type_id : Non-unique id for corresponding type. type_id :对应类型的非唯一 id。

count : Number of counts from first time a type_id was seen. count :从第一次看到 type_id 开始的计数。

unique_id : Combination of type_id and count : 'type_id|count' unique_id : type_id 和 count 的组合:'type_id|count'

col3 : Another column. col3 :另一列。

status : Has values - OK or Not_OK状态:有值 - OK 或 Not_OK

For a row in Ok_df there is atleast one row in Not_OK_df with the same type_id with count value less than count value of OK_df row.对于 Ok_df 中的一行,Not_OK_df 中至少有一行具有相同的 type_id,其计数值小于 OK_df 行的计数值。

I want to find Not_OK_df rows that satisfy the above condition ie,我想找到满足上述条件的 Not_OK_df 行,即,

Not_OK_df['type_id'] == OK_df['type_id'] & Not_OK_df['count'] < OK_df['count']
  • I tried using the above condition directly but got the following error :我尝试直接使用上述条件,但出现以下错误:

Reindexing only valid with uniquely valued Index objects

  • I can't set the matching type_id as index to retrieve rows since type_id isn't unique.我无法将匹配的 type_id 设置为索引来检索行,因为 type_id 不是唯一的。 I can't use unique_id as index to retrieve as it is unique to both the dataframes.我不能使用 unique_id 作为索引来检索,因为它对两个数据帧都是唯一的。

The expected output is :预期的输出是:

   type_id  count  col3 unique_id  status
0        1      1     1       1|1  Not_OK
1        3      1     5       3|1  Not_OK
2        3      3     4       3|3  Not_OK
3        3      4     7       3|4  Not_OK

Note : It doesn't contain rows with unique_id : ['3|6','1|3'] since there's no row in OK_df that has OK_df['count'] > not_OK_df['count'] .注意:它不包含具有 unique_id 的行: ['3|6','1|3'] 因为 OK_df 中没有行具有OK_df['count'] > not_OK_df['count']

How can I retrieve the required rows.如何检索所需的行。 Thanks in advance.提前致谢。

If I understand you correctly your selection criteria is as follows:如果我理解正确,您的选择标准如下:

  • The row from Not_ok_df must have the same type_id as a row in ok_df从行Not_ok_df必须具有相同的type_id为连续ok_df
  • The same row must have a count smaller than the maximum count from rows of the same type_id in ok_df同一行的count必须小于ok_df相同type_id行的最大count

First create a dictionary for the maximum value of count for each unique type_id .首先为每个唯一的type_idcount的最大值创建一个字典。

max_counts =OK_df.groupby('type_id').max()['count'].to_dict()

Then check if every row in Not_ok_df satisfies your criteria然后检查Not_ok_df每一行Not_ok_df满足您的条件

Not_OK_df[
    Not_OK_df.apply(
        lambda not_ok_row: max_counts[not_ok_row['type_id']] > not_ok_row['count'] #returns True if there exists a larger count in ok_df with the same type_id 
        if not_ok_row['type_id'] in max_counts else False, #checks to see if your Not_ok_df row's type_id exists in ok_df
        axis=1
    )
]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用非唯一列将具有求和值的熊猫Groupby数据框映射到另一个数据框 - How to map pandas Groupby dataframe with sum values to another dataframe using non-unique column 使用基于(非唯一)列值的其他行中的值替换 DataFrame 行中的 NaN 值 - Replacing NaN values in a DataFrame row with values from other rows based on a (non-unique) column value 我在选定的列中有具有非唯一值的 pd.DataFrame。 我怎样才能只留下具有所选列的唯一值的行? - I have pd.DataFrame with non-unique values in selected Column. How can i leave only rows with unique values ​of the selected column? 从Pandas Dataframe中找到列中的唯一值,然后查看这些值在另一列中是否具有相同的值 - From Pandas Dataframe find unique values in column and see if those values have the same values in another column 在 Pandas DataFrame 中查找具有相同索引的一列中的唯一值 - Find unique values in one column that have the same index in Pandas DataFrame 如何在非唯一列中按日期将pandas DataFrame条目分组 - How to group pandas DataFrame entries by date in a non-unique column 通过其他键将列添加到具有非唯一 ID 的 pyspark 数据框 - Add column to pyspark dataframe with non-unique ids by other key 提取具有非唯一索引列日期的 Dask dataframe 中的最新值 - Extracting latest values in a Dask dataframe with non-unique index column dates 总结DataFrame中的非唯一行 - Sum up non-unique rows in DataFrame 在 Pandas Dataframe 中查找非唯一行 - Finding non-unique rows in Pandas Dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM