简体   繁体   English

删除r中另一个data.frame中的data.frame行的确切行和频率

[英]Remove exact rows and frequency of rows of a data.frame that are in another data.frame in r

Consider the following two data.frames: 考虑以下两个data.frames:

a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)])

I would like to remove the exact rows of a1 that are in a2 so that the result should be: 我想删除a2确切的a1行,以便结果应该是:

A  B
4  d
5  e
4  d
2  b

Note that one row with 2 b in a1 is retained in the final result. 请注意,在最终结果中保留a1中具有2 b一行。 Currently, I use a looping statement, which becomes extremely slow as I have many variables and thousands of rows in my data.frames. 目前,我使用循环语句,由于我的data.frames中有许多变量和数千行,因此变得非常慢。 Is there any built-in function to get this result? 有没有内置函数来获得这个结果?

The idea is, add a counter for duplicates to each file, so you can get a unique match for each occurrence of a row. 我们的想法是,为每个文件添加一个重复计数器,这样您就可以获得每行出现的唯一匹配。 Data table is nice because it is easy to count the duplicates (with .N ), and it also gives the necessary function ( fsetdiff ) for set operations. 数据表很好,因为很容易计算重复项(使用.N ),它还为集合操作提供必要的函数( fsetdiff )。

library(data.table)

a1 <- data.table(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.table(A = c(1:3,2), B = letters[c(1:3,2)])

# add counter for duplicates
a1[, i := 1:.N, .(A,B)]
a2[, i := 1:.N, .(A,B)]

# setdiff gets the exception
# "all = T" allows duplicate rows to be returned
fsetdiff(a1, a2, all = T)

#    A B i
# 1: 4 d 1
# 2: 5 e 1
# 3: 4 d 2
# 4: 2 b 3

You could use dplyr to do this. 您可以使用dplyr执行此操作。 I set stringsAsFactors = FALSE to get rid of warnings about factor mismatches. 我设置stringsAsFactors = FALSE来摆脱因素不匹配的警告。

library(dplyr)

a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)], stringsAsFactors = FALSE)
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)], stringsAsFactors = FALSE)

## Make temp variables to join on then delete later.
# Create a row number
a1_tmp <- 
    a1 %>%
    group_by(A, B) %>%
    mutate(tmp_id = row_number()) %>%
    ungroup()
# Create a count
a2_tmp <-
    a2 %>%
     group_by(A, B) %>%
     summarise(count = n()) %>%
     ungroup()

## Keep all that have no entry int a2 or the id > the count (i.e. used up a2 entries).
left_join(a1_tmp, a2_tmp, by = c('A', 'B')) %>%
    ungroup() %>% filter(is.na(count) | tmp_id > count) %>%
    select(-tmp_id, -count)

## # A tibble: 4 x 2
##       A     B
##   <dbl> <chr>
## 1     4     d
## 2     5     e
## 3     4     d
## 4     2     b

EDIT 编辑

Here is a similar solution that is a little shorter. 这是一个类似的解决方案,有点短。 This does the following: (1) add a column for row number to join both data.frame items (2) a temporary column in a2 (2nd data.frame ) that will show up as null in the join to a1 (ie indicates it's unique to a1 ). 这将执行以下操作:(1)为行号添加一列以连接两个data.frame项(2) a2 (第二个data.frame )中的临时列,它将在连接中显示为null为a1 (即表示它是独特的a1 )。

library(dplyr)

left_join(a1 %>% group_by(A,B) %>% mutate(rn = row_number())             %>% ungroup(),
          a2 %>% group_by(A,B) %>% mutate(rn = row_number(), tmpcol = 0) %>% ungroup(),
          by = c('A', 'B', 'rn')) %>%
filter(is.na(tmpcol)) %>%
select(-tmpcol, -rn)

## # A tibble: 4 x 2
##       A     B
##   <dbl> <chr>
## 1     4     d
## 2     5     e
## 3     4     d
## 4     2     b

I think this solution is a little simpler (perhaps very little) than the first. 我认为这个解决方案比第一个解决方案更简单(也许很少)。

I guess this is similar to DWal's solution but in base R 我想这与DWal的解决方案类似,但在基础R中

a1_temp = Reduce(paste, a1)
a1_temp = paste(a1_temp, ave(seq_along(a1_temp), a1_temp, FUN = seq_along))

a2_temp = Reduce(paste, a2)
a2_temp = paste(a2_temp, ave(seq_along(a2_temp), a2_temp, FUN = seq_along))

a1[!a1_temp %in% a2_temp,]
#  A B
#4 4 d
#5 5 e
#7 4 d
#8 2 b

Here's another solution with dplyr : 这是dplyr的另一个解决方案:

library(dplyr)
a1 %>%
  arrange(A) %>%
  group_by(A) %>%
  filter(!(paste0(1:n(), A, B) %in% with(arrange(a2, A), paste0(1:n(), A, B))))

Result: 结果:

# A tibble: 4 x 2
# Groups:   A [3]
      A      B
  <dbl> <fctr>
1     2      b
2     4      d
3     4      d
4     5      e

This way of filtering avoids creating extra unwanted columns that you have to later remove in the final output. 这种过滤方式可以避免创建额外的不需要的列,以后必须在最终输出中删除这些列。 This method also sorts the output. 此方法还对输出进行排序。 Not sure if it's what you want. 不确定这是不是你想要的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM