简体   繁体   English

根据条件R匹配和删除行

[英]Match and Remove Rows Based on Condition R

I've got an interesting one for you all. 我为你们准备了一个有趣的东西。

I'm looking to first: Look through the ID column and identify duplicate values. 我首先要查看:查看ID列并确定重复值。 Once those are identified, the code should go through the income of the duplicated values and keep the row with the larger income. 一旦确定了这些,代码应该通过重复值的收入并保持行具有更大的收入。

So if there are three ID values of 2, it will look for the one with the highest income and keep that row. 因此,如果有三个ID值为2,它将查找收入最高的那个并保留该行。

 ID Income 1 98765 2 3456 2 67 2 5498 5 23 6 98 7 5645 7 67871 9 983754 10 982 10 2374 10 875 10 4744 11 6853 

I know its as easy as subsetting based on a condition, but I don't know how to remove the rows based on if the income in one cell is greater than the other.(Only done if the id's match) 我知道它像基于条件的子集一样容易,但我不知道如何根据一个单元格中的收入是否大于另一个单元格来删除行。(仅在id匹配时才进行)

I was thinking of using an ifelse statement to create a new column to identify duplicates (through subsetting or not) then use the new column's values to ifelse again to identify the larger income. 我正在考虑使用ifelse语句来创建一个新列以识别重复项(通过子集化或不通过子集化),然后再使用新列的值来确定更大的收入。 From there I can just subset based on the new columns I have created. 从那里我可以根据我创建的新列进行子集化。

Is there a faster, more efficient way of doing this? 有更快,更有效的方法吗?

The outcome should look like this. 结果应该是这样的。

 ID Income 1 98765 2 5498 5 23 6 98 7 67871 9 983754 10 4744 11 6853 

Thank you 谢谢

We can slice the rows by checking the highest value in 'Income' grouped by 'ID' 我们可以通过检查“收入”中按“ID”分组的最高值来slice

library(dplyr)
df1 %>%
  group_by(ID) %>%
  slice(which.max(Income))

Or using data.table 或者使用data.table

library(data.table)
setDT(df1)[, .SD[which.max(Income)], by = ID]

Or with base R 或者用base R

df1[with(df1, ave(Income, ID, FUN = max) == Income),]
#     ID Income
#1   1  98765
#4   2   5498
#5   5     23
#6   6     98
#8   7  67871
#9   9 983754
#13 10   4744
#14 11   6853

data 数据

df1 <- structure(list(ID = c(1L, 2L, 2L, 2L, 5L, 6L, 7L, 7L, 9L, 10L, 
10L, 10L, 10L, 11L), Income = c(98765L, 3456L, 67L, 5498L, 23L, 
98L, 5645L, 67871L, 983754L, 982L, 2374L, 875L, 4744L, 6853L)), 
class = "data.frame", row.names = c(NA, 
-14L))

order with duplicated ( Base R) order duplicated (Base R)

df=df[order(df$ID,-df$Income),]
df[!duplicated(df$ID),]
   ID Income
1   1  98765
4   2   5498
5   5     23
6   6     98
8   7  67871
9   9 983754
13 10   4744
14 11   6853

Here is another dplyr method. 这是另一种dplyr方法。 We can arrange the column and then slice the data frame for the first row. 我们可以排列列,然后切片第一行的数据帧。

library(dplyr)

df2 <- df %>%
  arrange(ID, desc(Income)) %>%
  group_by(ID) %>%
  slice(1) %>%
  ungroup()
df2
# # A tibble: 8 x 2
#      ID Income
#   <int>  <int>
# 1     1  98765
# 2     2   5498
# 3     5     23
# 4     6     98
# 5     7  67871
# 6     9 983754
# 7    10   4744
# 8    11   6853

DATA 数据

df <- read.table(text = "ID Income
1   98765
2   3456
2   67
2   5498
5   23
6   98
7   5645
7   67871
9   983754
10  982
10  2374
10  875
10  4744
11  6853",
                 header = TRUE)

Group_by and summarise from dplyr would work too 来自dplyr的Group_by和总结也会起作用

df1 %>% 
  group_by(ID) %>% 
  summarise(Income=max(Income))

     ID  Income
  <int>   <dbl>
1     1  98765.
2     2   5498.
3     5     23.
4     6     98.
5     7  67871.
6     9 983754.
7    10   4744.
8    11   6853.

Using sqldf : Group by ID and select the corresponding max Income 使用sqldf :按ID分组并选择相应的max Income

library(sqldf)
sqldf("select ID,max(Income) from df group by ID")

Output: 输出:

  ID max(Income)
1  1       98765
2  2        5498
3  5          23
4  6          98
5  7       67871
6  9      983754
7 10        4744
8 11        6853

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM