简体   繁体   English

R data.table group by和ite列两列

[英]R data.table group by and iterate by two columns

I am new to R and trying to solve the following problem: 我是R的新手并试图解决以下问题:

There is a table with two columns books and readers of these books, where books and readers are book and reader IDs, respectively : 这里有一本有两列booksreaders的书桌,其中booksreaders分别是书籍和读者ID:

> books = c (1,2,3,1,1,2)
> readers = c(30, 10, 20, 20, 10, 30)
> bt = data.table(books, readers)
> bt
   books readers
1:     1      30
2:     2      10
3:     3      20
4:     1      20
5:     1      10
6:     2      30

For each book pair I need to count number of readers who read both of these books, with this algoritm: 对于每本书对,我需要计算阅读这两本书的读者数量,这个算法:

for each book
  for each reader of the book
    for each other_book in books of the reader
      increment common_reader_count ((book, other_book), cnt)

To implement the above algorithm I need to group this data into two lists: 1) a book list containing readers of each book and 2) readers list, containing books read by each reader, such as: 为了实现上述算法,我需要将这些数据分组为两个列表:1)包含每本书的读者的书籍列表和2)读者列表,其中包含每个读者阅读的书籍,例如:

> bookList = list( 
+ list(1, list(30, 20, 10)),
+         list(2, list(10, 30)),
+         list(3, list(20))
+       )
> 
> readerList = list (
+ list(30, list(1,2)),
+ list(20, list(3,1)),
+ list(10, list(2,1))
+ )
>  

Questions: 问题:

1) What functions to use to build these lists from a book table? 1)用于从书籍表构建这些列表的功能是什么?

2) From bookList and readerList how to generate book pairs with number of readers who read both of these books? 2)从bookListreaderList如何生成具有阅读这两本书的读者数量的图书对? For the bt book table described above, result should be: 对于上述bt书表,结果应为:

((1, 2), 2)
((1,3), 1)
((2,3), 0)  

Order of books in pair does not matter, so, for example (1,2) and (2,1) should be reduced to either one. 成对的书籍顺序并不重要,因此,例如(1,2)(2,1)应该减少到任何一个。

Please advise functions and data sructures to solve this. 请告知功能和数据结构以解决此问题。 Thanks! 谢谢!

Update: 更新:

Idealy as a result I need to get a matrix with book id's both as rows and columns. 理想情况下,我需要获得一个矩阵,其中book id既是行又是列。 Intersection is a count of readers that read both of the books in the pair. 交叉点是阅读这对书中两本书的读者数量。 So for the above example matrix should be: 所以对于上面的例子,矩阵应该是:

books | 1 | 2 | 3 |
   1  | 1 | 2 | 1 |
   2  | 2 | 1 | 0 |
   3  | 1 | 0 | 1 |

   Which means:

   book 1 and 2 are read together by 2 readers 
   book 1 and 3 are read together by 1 reader
   book 2 and 3 are read together by 0 readers

How to build such a matrix? 如何构建这样的矩阵?

Here is another option: 这是另一种选择:

combs <- combn(unique(books), 2)# Generate combos of books
setkey(bt, books)
both.read <-bt[                 # Cartesian join all combos to our data
  data.table(books=c(combs), combo.id=c(col(combs))), allow.cartesian=T
][,
  .(                            # For each combo, figure out how many readers show up twice, meaning they've read both books
    read.both=sum(duplicated(readers)), 
    book1=min(books), book2=max(books)
  ),
  by=combo.id
]
dcast.data.table(               # dcast to desired format
  both.read, book1 ~ book2, value.var="read.both", fun.aggregate=sum
)

Produces: 生产:

   book1 2 3
1:     1 2 1
2:     2 0 0

Note by design this only does non-equivalent combinations (ie we don't show books 1-2 and 2-1, only 1-2, since they are the same). 请注意,这只是非等效的组合(即我们不显示1-2和2-1的书,只有1-2,因为它们是相同的)。

try this: 试试这个:

## gives you a seperate list for each book
list_bookls <- split(bt$readers, books)

## gives you a seperate list for each reader
list_readers <- split(bt$books, readers)

another form of output with the output as a data.table and giving the number of books read by each reader and the number of books each reader reads: 另一种形式的输出,输出为data.table,给出每个读者阅读的书籍数量和每个读者阅读的书籍数量:

bt[ , .("N Books" = length(unique(books))), by = readers]
bt[ , .("N Readers" = length(unique(readers))), by = readers]

for the second part of your question I would use the following: 对于你的问题的第二部分,我将使用以下内容:

bt2 <- bt[ , .N, by = .(readers, books)]
library(tidyr)
spread(bt2, key = books, value = "N", fill = 0)

Output is a table that gives 1 if the books is read by reader X and 0 otherwise: 输出是一个表,如果读者X读取书籍则为1,否则为0:

   readers 1 2 3
1:      10 1 1 0
2:      20 1 0 1
3:      30 1 1 0

Here's a base R solution to test if the pairs were read. 这是一个基本的R解决方案,用于测试是否读取了对。 Someone else can add one for data.table if you absolutely need to use it: 如果你绝对需要使用它,其他人可以为data.table添加一个:

books = c (1,2,3,1,1,2)
readers = c(30, 10, 20, 20, 10, 30)
bks = data.frame(books, readers)

cmb <- combn(unique(books), 2)
cmb <- t(cmb)
combos <- as.data.frame(cmb)
bktbl <- t(table(bks))

for (i in 1:nrow(bktbl)) {
  x[i] <- sum(bktbl[i, cmb[i, 1]], bktbl[i, cmb[i, 2]])
  combos$PairRead <- ifelse(x > 1,"yes", "no")
}
combos
  V1 V2 PairRead
1  1  2      yes
2  1  3      yes
3  2  3       no

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM