简体   繁体   English

如何删除R中存在重复并满足另一个条件的行?

[英]How to delete rows which have duplicates and meet another condition in R?

I don't know if this may be a too specific question, but I'm looking to remove rows which have duplicates in one column, and meet a condition.我不知道这是否是一个过于具体的问题,但我希望删除在一列中有重复的行,并满足一个条件。

To be specific, I want to delete one of the duplicate observations in the column "host_id" (numeric), for which the value in the column "reviews_per_month" (numeric) is the lowest.具体来说,我想删除“host_id”(数字)列中的重复观察之一,其中“reviews_per_month”(数字)列中的值最低。

In other words, as described in my report: " Since one host can have multiple listings, hosts ids that appear more than one time will be filtered. The listing of this host's id which has the most reviews per month is used for analysis ".换句话说,正如我的报告中所描述的:“由于一个主机可以有多个房源,因此出现多次的主机ID将被过滤。每月评论最多的该主机ID的列表用于分析”。

I've tried many things using duplicated(), filter(), ifelse(), casewhen(), etc, but it doesn't seem to work.我已经尝试了很多事情,使用 duplicated()、filter()、ifelse()、casewhen() 等,但它似乎不起作用。 Does anyone know how to get started?有谁知道如何开始? Thanks in advance!提前致谢!

We can use slice_max .我们可以使用slice_max Grouped by 'host_id', slice the row where the reviews_per_month is the max按'host_id'分组,将reviews_per_monthmax的行slice

library(dplyr)
df %>%
   group_by(host_id) %>%
   slice_max(reviews_per_month)

Or if it is to remove the min observation alone或者如果是单独删除min观察

df %>%
   group_by(host_id) %>%
   filter(reviews_per_month != min(reviews_per_month, na.rm = TRUE))

Try this:尝试这个:

df <- data.frame(x = c("a", "a", "b", "b"), y = c(1, 2, 2, 1)) # Test data

library(dplyr)

df %>% 
  distinct(x, .keep_all = T)
# Wrong!

df %>% 
  arrange(-y) %>% 
  distinct(x, .keep_all = T)
# This is how you want to have it

To be a bit more verbose: You just want one entry in your host_id variable ( x in the example above), so you'll want to use distinct() .更详细一点:您只希望在host_id变量中有一个条目(上面示例中的x ),因此您需要使用distinct() But distinct() keeps just the first observation of the variable that is fed into distinct (in your case: host_id ), so you have to sort the data in decreasing fashion first.但是distinct()只保留输入 distinct 的变量的第一次观察(在您的情况下: host_id ),因此您必须首先以递减方式对数据进行排序。 I use arrange(-y) in my example, you should replace y by reviews_per_month .我在示例中使用了arrange(-y) ,您应该将y替换为reviews_per_month

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM