简体   繁体   中英

Remove exact rows and frequency of rows of a data.frame that are in another data.frame in r

Consider the following two data.frames:

a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)])

I would like to remove the exact rows of a1 that are in a2 so that the result should be:

A  B
4  d
5  e
4  d
2  b

Note that one row with 2 b in a1 is retained in the final result. Currently, I use a looping statement, which becomes extremely slow as I have many variables and thousands of rows in my data.frames. Is there any built-in function to get this result?

The idea is, add a counter for duplicates to each file, so you can get a unique match for each occurrence of a row. Data table is nice because it is easy to count the duplicates (with .N ), and it also gives the necessary function ( fsetdiff ) for set operations.

library(data.table)

a1 <- data.table(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.table(A = c(1:3,2), B = letters[c(1:3,2)])

# add counter for duplicates
a1[, i := 1:.N, .(A,B)]
a2[, i := 1:.N, .(A,B)]

# setdiff gets the exception
# "all = T" allows duplicate rows to be returned
fsetdiff(a1, a2, all = T)

#    A B i
# 1: 4 d 1
# 2: 5 e 1
# 3: 4 d 2
# 4: 2 b 3

You could use dplyr to do this. I set stringsAsFactors = FALSE to get rid of warnings about factor mismatches.

library(dplyr)

a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)], stringsAsFactors = FALSE)
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)], stringsAsFactors = FALSE)

## Make temp variables to join on then delete later.
# Create a row number
a1_tmp <- 
    a1 %>%
    group_by(A, B) %>%
    mutate(tmp_id = row_number()) %>%
    ungroup()
# Create a count
a2_tmp <-
    a2 %>%
     group_by(A, B) %>%
     summarise(count = n()) %>%
     ungroup()

## Keep all that have no entry int a2 or the id > the count (i.e. used up a2 entries).
left_join(a1_tmp, a2_tmp, by = c('A', 'B')) %>%
    ungroup() %>% filter(is.na(count) | tmp_id > count) %>%
    select(-tmp_id, -count)

## # A tibble: 4 x 2
##       A     B
##   <dbl> <chr>
## 1     4     d
## 2     5     e
## 3     4     d
## 4     2     b

EDIT

Here is a similar solution that is a little shorter. This does the following: (1) add a column for row number to join both data.frame items (2) a temporary column in a2 (2nd data.frame ) that will show up as null in the join to a1 (ie indicates it's unique to a1 ).

library(dplyr)

left_join(a1 %>% group_by(A,B) %>% mutate(rn = row_number())             %>% ungroup(),
          a2 %>% group_by(A,B) %>% mutate(rn = row_number(), tmpcol = 0) %>% ungroup(),
          by = c('A', 'B', 'rn')) %>%
filter(is.na(tmpcol)) %>%
select(-tmpcol, -rn)

## # A tibble: 4 x 2
##       A     B
##   <dbl> <chr>
## 1     4     d
## 2     5     e
## 3     4     d
## 4     2     b

I think this solution is a little simpler (perhaps very little) than the first.

I guess this is similar to DWal's solution but in base R

a1_temp = Reduce(paste, a1)
a1_temp = paste(a1_temp, ave(seq_along(a1_temp), a1_temp, FUN = seq_along))

a2_temp = Reduce(paste, a2)
a2_temp = paste(a2_temp, ave(seq_along(a2_temp), a2_temp, FUN = seq_along))

a1[!a1_temp %in% a2_temp,]
#  A B
#4 4 d
#5 5 e
#7 4 d
#8 2 b

Here's another solution with dplyr :

library(dplyr)
a1 %>%
  arrange(A) %>%
  group_by(A) %>%
  filter(!(paste0(1:n(), A, B) %in% with(arrange(a2, A), paste0(1:n(), A, B))))

Result:

# A tibble: 4 x 2
# Groups:   A [3]
      A      B
  <dbl> <fctr>
1     2      b
2     4      d
3     4      d
4     5      e

This way of filtering avoids creating extra unwanted columns that you have to later remove in the final output. This method also sorts the output. Not sure if it's what you want.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM