Match and Remove Rows Based on Condition R

Question

I've got an interesting one for you all.

I'm looking to first: Look through the ID column and identify duplicate values. Once those are identified, the code should go through the income of the duplicated values and keep the row with the larger income.

So if there are three ID values of 2, it will look for the one with the highest income and keep that row.

 ID Income 1 98765 2 3456 2 67 2 5498 5 23 6 98 7 5645 7 67871 9 983754 10 982 10 2374 10 875 10 4744 11 6853

I know its as easy as subsetting based on a condition, but I don't know how to remove the rows based on if the income in one cell is greater than the other.(Only done if the id's match)

I was thinking of using an ifelse statement to create a new column to identify duplicates (through subsetting or not) then use the new column's values to ifelse again to identify the larger income. From there I can just subset based on the new columns I have created.

Is there a faster, more efficient way of doing this?

The outcome should look like this.

 ID Income 1 98765 2 5498 5 23 6 98 7 67871 9 983754 10 4744 11 6853

Thank you

Answer 1

We can slice the rows by checking the highest value in 'Income' grouped by 'ID'

library(dplyr)
df1 %>%
  group_by(ID) %>%
  slice(which.max(Income))

Or using data.table

library(data.table)
setDT(df1)[, .SD[which.max(Income)], by = ID]

Or with base R

df1[with(df1, ave(Income, ID, FUN = max) == Income),]
#     ID Income
#1   1  98765
#4   2   5498
#5   5     23
#6   6     98
#8   7  67871
#9   9 983754
#13 10   4744
#14 11   6853

data

df1 <- structure(list(ID = c(1L, 2L, 2L, 2L, 5L, 6L, 7L, 7L, 9L, 10L, 
10L, 10L, 10L, 11L), Income = c(98765L, 3456L, 67L, 5498L, 23L, 
98L, 5645L, 67871L, 983754L, 982L, 2374L, 875L, 4744L, 6853L)), 
class = "data.frame", row.names = c(NA, 
-14L))

Answer 2

order with duplicated ( Base R)

df=df[order(df$ID,-df$Income),]
df[!duplicated(df$ID),]
   ID Income
1   1  98765
4   2   5498
5   5     23
6   6     98
8   7  67871
9   9 983754
13 10   4744
14 11   6853

Answer 3

Here is another dplyr method. We can arrange the column and then slice the data frame for the first row.

library(dplyr)

df2 <- df %>%
  arrange(ID, desc(Income)) %>%
  group_by(ID) %>%
  slice(1) %>%
  ungroup()
df2
# # A tibble: 8 x 2
#      ID Income
#   <int>  <int>
# 1     1  98765
# 2     2   5498
# 3     5     23
# 4     6     98
# 5     7  67871
# 6     9 983754
# 7    10   4744
# 8    11   6853

DATA

df <- read.table(text = "ID Income
1   98765
2   3456
2   67
2   5498
5   23
6   98
7   5645
7   67871
9   983754
10  982
10  2374
10  875
10  4744
11  6853",
                 header = TRUE)

Answer 4

Group_by and summarise from dplyr would work too

df1 %>% 
  group_by(ID) %>% 
  summarise(Income=max(Income))

     ID  Income
  <int>   <dbl>
1     1  98765.
2     2   5498.
3     5     23.
4     6     98.
5     7  67871.
6     9 983754.
7    10   4744.
8    11   6853.

Answer 5

Using sqldf : Group by ID and select the corresponding max Income

library(sqldf)
sqldf("select ID,max(Income) from df group by ID")

Output:

  ID max(Income)
1  1       98765
2  2        5498
3  5          23
4  6          98
5  7       67871
6  9      983754
7 10        4744
8 11        6853

Match and Remove Rows Based on Condition R

Question

5 answers

solution1
3 ACCPTED 2018-09-07 16:26:11

data

solution2
3 2018-09-07 16:31:00

solution3
3 2018-09-09 13:38:56

solution4
2 2018-09-07 16:44:01

solution5
2 2018-09-07 16:45:37

Match and Remove Rows Based on Condition R

Question

5 answers

solution1 3 ACCPTED 2018-09-07 16:26:11

data

solution2 3 2018-09-07 16:31:00

solution3 3 2018-09-09 13:38:56

solution4 2 2018-09-07 16:44:01

solution5 2 2018-09-07 16:45:37

solution1
3 ACCPTED 2018-09-07 16:26:11

solution2
3 2018-09-07 16:31:00

solution3
3 2018-09-09 13:38:56

solution4
2 2018-09-07 16:44:01

solution5
2 2018-09-07 16:45:37