简体   繁体   English

是否存在用于识别每行中“ n”个匹配项的R函数?

[英]Is there an R function for identifying 'n' matches in each row?

I am attempting to aggregate my data to find correlations/patterns, and want to discover how and where data may correlate. 我正在尝试汇总我的数据以找到相关性/模式,并想发现数据可能如何关联以及在何处关联。 Specifically, I want to identify how many times an id (here called 'item') appear together. 具体来说,我想确定ID(此处称为“项目”)一起出现多少次。 Is there a way to find how many times each (id) appear together in a row? 有没有一种方法可以找出每个(id)一起出现多少次?

This is for a larger data.frame that has already been cleaned and aggregated based on this particular inquiry. 这是针对已经根据此特定查询清理和聚合的较大data.frame。 In the past, I have tried to apply multiple aggregation, summation and filter functions from packages like 'data.table','dplyr', and 'tidyverse' but cannot quite get what I am looking for. 过去,我曾尝试从“ data.table”,“ dplyr”和“ tidyverse”之类的包中应用多个聚合,求和和过滤功能,但并不能完全满足我的需求。

In section 3( Show some code ) I have provided a minimal reproducible example: 在第3部分( 显示一些代码 )中,我提供了一个最小的可重现示例:

set.seed(1234)
random.people<-c("Bob","Tim","Jackie","Angie","Christopher")
number=sample(12345:12350,2000,replace = T)
item=sample(random.people,2000,replace=T)

sample_data <- data.frame(cbind(number,item), stringsAsFactors = FALSE)

Using the examples here ,I expected the output to ID all the combinations where names were aggregated to a number and show the n (value) - expecting results to resemble something like: 使用此处的示例,我希望将名称组合为一个数字并显示n(值)的所有组合的输出输出到ID,并期望结果类似于:

Pair       value
Bob, Tim     2
Bob, Jackie  4
Bob, Angie   0

This output (what I am hoping to get) would tell me that in the entire df, there are 2 times that Bob and Tim and 4 times that Bob and Jackie both have the same number. 这个输出(我希望得到的)将告诉我,在整个df中, 鲍勃和蒂姆的数字是2倍, 鲍勃和杰基的数字是4倍。

but the actual output is: 但实际输出是:

Error: Each row of output must be identified by a unique combination of keys.

Keys are shared for 2000 rows:
* 9, 23, 37, 164, 170, 180, 211...

Update: I thought of a..creative(?) solution - but hope someone can help with expedting it. 更新:我想到了一个..creative(?)解决方案-但希望有人可以帮助它进行加速。 I can locate all the numbers (column1) that are shared between two names using the following: 我可以使用以下命令找到两个名称之间共享的所有数字(column1):

x1<-sample_data %>% dplyr::filter(item=="Bob")
x2<-sample_data %>% dplyr::filter(item=="Tim")
Bob<-x1[,1]
Tim<-x2[,1]
Reduce(intersect, list(Bob,Tim))

output: 输出:

[1] "12345" "12348" "12350" "12346" "12349" "12347"

Like I said, this is very time consuming and would require creating a plethora of vectors and intersecting each(eg 1 vector for each name) and multiple combinations. 就像我说的那样,这非常耗时,需要创建过多的向量并将它们相交(例如,每个名称有1个向量)和多个组合。

set.seed(1234)
random.people<-c("Bob","Tim","Jackie","Angie","Christopher")
number=sample(12345:22350,2000,replace = T) # I edited ur number here.
item=sample(random.people,2000,replace=T)

sample_data <- data.frame(cbind(number,item), stringsAsFactors = FALSE)

library(tidyverse)
sample_data %>%
  # find out unique rows
  distinct() %>%
  # nest the data frame into nested tibble, so now you have
  # a "data" column, which is a list of small data frames.
  group_nest(number) %>%
  # Here we use purrr::map to modify the list column. We want each 
  # combination counts only once despite the order, so we use sort. 
  mutate(data = map_chr(data, ~paste(sort(.x$item), collapse = ", "))) %>%
  # the last two steps just count the numbers
  group_by(data) %>%
  count()

# A tibble: 21 x 2
# Groups:   data [21]
   data                         n
   <chr>                    <int>
 1 Angie                      336
 2 Angie, Bob                   8
 3 Angie, Bob, Christopher      2
 4 Angie, Bob, Jackie           1
 5 Angie, Christopher          16
 6 Angie, Jackie                9
 7 Angie, Tim                  10
 8 Bob                        331
 9 Bob, Christopher            12
10 Bob, Christopher, Jackie     1
# … with 11 more rows

One possible solution 一种可能的解决方案

Here's a base R solution which relies on table -> aggregate and potentially an inefficient way to paste the names together using apply . 这是一个基本的R解决方案,它依赖于table > aggregate并且可能是使用apply将名称粘贴在一起的一种低效方式。

tab_data <-  data.frame(unclass(table(unique(sample_data))))
#table results in columns c(Angie.1, Bob.1, ...) - this makes it look better
names(tab_data) = sort(random.people) 

library(network)
plot.network.default(as.network(tab_data))

tab_data$n <- 1

agg_data <- aggregate(n~., data = tab_data, FUN = length)
agg_data$Pair <- apply(agg_data[, -length(agg_data)], 1, function(x) paste(names(x[x!=0]), collapse = ', '))


agg_data[order(agg_data$Pair), c('Pair', 'n') ]

                            Pair   n
1                          Angie 336
3                     Angie, Bob   8
7        Angie, Bob, Christopher   2
11            Angie, Bob, Jackie   1
5             Angie, Christopher  16
9                  Angie, Jackie   9
15                    Angie, Tim  10
2                            Bob 331
6               Bob, Christopher  12
... truncated ...

As far as performance, on this relatively small data set, it's around 9x faster than the dplyr solution: 就性能而言,在这个相对较小的数据集上,它比dplyr解决方案快9倍:

Unit: milliseconds
           expr     min       lq     mean   median       uq      max neval
  base_solution  9.4795  9.65215 10.80984  9.87625 10.32125  46.8230   100
 dplyr_solution 78.6070 81.72155 86.47891 83.96435 86.40495 200.7784   100

Data 数据

set.seed(1234)
random.people<-c("Bob","Tim","Jackie","Angie","Christopher")
number=sample(12345:22350,2000,replace = T) # I edited ur number here.
item=sample(random.people,2000,replace=T)

sample_data <- data.frame(number,item, n = 1L, stringsAsFactors = FALSE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM