简体   繁体   English

根据2个条件返回列表中的重复项

[英]Return duplicates in a list based on 2 criteria

I have a list that contains 2 data sets. 我有一个包含2个数据集的列表。

a = data.frame(c(1,1,1,1,1,2,2,2,2,2), c("a","b", "c", "d","e","e","f", "g", "h","i"))
colnames(a) = c("Numbers","Letters")
c = data.frame(c(3,3,3,3,3,4,4,4,4,4), c("q","r", "s", "t","u","u","v", "w", "x","y"))
colnames(c) = c("Numbers","Letters")
my.list = list(a,c)
my.list

I am interest in returning only the letters that are found in common between the unique numbers of each data set. 我有兴趣只返回每个数据集的唯一编号之间共同的字母。 The desired results are given by the following: 期望的结果如下:

new_a = data.frame(c(1,2),c("e","e"))
new_c = data.frame(c(3,4),c("u","u"))
colnames(new_a) = c("Numbers","Letters")
colnames(new_c) = c("Numbers","Letters")
my.new.list = list(new_a,new_c)
my.new.list

As you will see, letter "e" is the only common letter that numbers "1" and "2" share in data set 1 while letter "u" is the only common letter shared by numbers 3 and 4 in data set 2. 正如您将看到的,字母“e”是数字集1中数字“1”和“2”共用的唯一公用字母,而字母“u”是数据集2中数字3和4共享的唯一公用字母。

I am trying to do this for a very large list. 我想为一个非常大的列表做这个。 To give you an idea of my true problem, I have a list where each element is a state. 为了让您了解我的真实问题,我有一个列表,其中每个元素都是一个状态。 Within each state, I have multiple asset managers or "accounts" and each account holds multiple tickers. 在每个州内,我有多个资产经理或“账户”,每个账户都有多个代码。 I am trying to find the tickers that the accounts have in common for each geographical locations. 我试图找到帐户对每个地理位置有共同点的代码。 In the above example, the numbers would be the accounts, the letters would be the tickers and the two data sets contained in the list would be two different states. 在上面的例子中,数字将是帐户,字母将是代码,列表中包含的两个数据集将是两个不同的状态。 I hope that helps frame my problem. 我希望这有助于解决我的问题。

Thanks! 谢谢!

library(data.table)
a <- as.data.table(a)
a[, if(.N > 1) .SD, by = list(Letters)]
#    Letters Numbers
# 1:       e       1
# 2:       e       2

Explanation: Take table a and group by the column Letters ( by = list(Letters) ) and return the subset of data for each group ( .SD ) only when the number of rows ( .N ) for that group is >1. 解释:取表a和组由列Lettersby = list(Letters) ),并返回数据的子集为每个组( .SD )仅当行数(数.N组是> 1。

We can use Reduce with intersect in base R 我们可以在base R使用intersect Reduce

 lapply(my.list, function(x) x[with(x, Letters %in%
                 Reduce(intersect, split(Letters, Numbers))),])

Or using dplyr 或者使用dplyr

 library(dplyr)
 lapply(my.list, function(x)
                    x %>% 
                        group_by(Letters) %>% 
                        filter(n_distinct(Numbers)==2))

Instead of having a list , it can be changed to a single dataset with an additional grouping column and then do the same, 可以将其更改为具有附加分组列的单个数据集,而不是使用list ,然后执行相同的操作,

 library(tidyr)
 unnest(my.list, group) %>%
            group_by(group, Letters) %>%
            filter(n_distinct(Numbers)==2)

If we don't know the number of unique Numbers in each list elements 如果我们不知道每个列表元素中唯一数字的数量

  unnest(my.list, group) %>% 
              group_by(group) %>% 
              mutate(n= n_distinct(Numbers)) %>%
              group_by(Letters, add=TRUE) %>% 
              filter(n_distinct(Numbers)==n) %>%
              select(-n)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM