[英]R data.table group by and iterate by two columns
I am new to R and trying to solve the following problem: 我是R的新手并试图解决以下问题:
There is a table with two columns books
and readers
of these books, where books
and readers
are book and reader IDs, respectively : 这里有一本有两列books
和readers
的书桌,其中books
和readers
分别是书籍和读者ID:
> books = c (1,2,3,1,1,2)
> readers = c(30, 10, 20, 20, 10, 30)
> bt = data.table(books, readers)
> bt
books readers
1: 1 30
2: 2 10
3: 3 20
4: 1 20
5: 1 10
6: 2 30
For each book pair I need to count number of readers who read both of these books, with this algoritm: 对于每本书对,我需要计算阅读这两本书的读者数量,这个算法:
for each book
for each reader of the book
for each other_book in books of the reader
increment common_reader_count ((book, other_book), cnt)
To implement the above algorithm I need to group this data into two lists: 1) a book list containing readers of each book and 2) readers list, containing books read by each reader, such as: 为了实现上述算法,我需要将这些数据分组为两个列表:1)包含每本书的读者的书籍列表和2)读者列表,其中包含每个读者阅读的书籍,例如:
> bookList = list(
+ list(1, list(30, 20, 10)),
+ list(2, list(10, 30)),
+ list(3, list(20))
+ )
>
> readerList = list (
+ list(30, list(1,2)),
+ list(20, list(3,1)),
+ list(10, list(2,1))
+ )
>
Questions: 问题:
1) What functions to use to build these lists from a book table? 1)用于从书籍表构建这些列表的功能是什么?
2) From bookList
and readerList
how to generate book pairs with number of readers who read both of these books? 2)从bookList
和readerList
如何生成具有阅读这两本书的读者数量的图书对? For the bt
book table described above, result should be: 对于上述bt
书表,结果应为:
((1, 2), 2)
((1,3), 1)
((2,3), 0)
Order of books in pair does not matter, so, for example (1,2)
and (2,1)
should be reduced to either one. 成对的书籍顺序并不重要,因此,例如(1,2)
和(2,1)
应该减少到任何一个。
Please advise functions and data sructures to solve this. 请告知功能和数据结构以解决此问题。 Thanks! 谢谢!
Update: 更新:
Idealy as a result I need to get a matrix with book id's both as rows and columns. 理想情况下,我需要获得一个矩阵,其中book id既是行又是列。 Intersection is a count of readers that read both of the books in the pair. 交叉点是阅读这对书中两本书的读者数量。 So for the above example matrix should be: 所以对于上面的例子,矩阵应该是:
books | 1 | 2 | 3 |
1 | 1 | 2 | 1 |
2 | 2 | 1 | 0 |
3 | 1 | 0 | 1 |
Which means:
book 1 and 2 are read together by 2 readers
book 1 and 3 are read together by 1 reader
book 2 and 3 are read together by 0 readers
How to build such a matrix? 如何构建这样的矩阵?
Here is another option: 这是另一种选择:
combs <- combn(unique(books), 2)# Generate combos of books
setkey(bt, books)
both.read <-bt[ # Cartesian join all combos to our data
data.table(books=c(combs), combo.id=c(col(combs))), allow.cartesian=T
][,
.( # For each combo, figure out how many readers show up twice, meaning they've read both books
read.both=sum(duplicated(readers)),
book1=min(books), book2=max(books)
),
by=combo.id
]
dcast.data.table( # dcast to desired format
both.read, book1 ~ book2, value.var="read.both", fun.aggregate=sum
)
Produces: 生产:
book1 2 3
1: 1 2 1
2: 2 0 0
Note by design this only does non-equivalent combinations (ie we don't show books 1-2 and 2-1, only 1-2, since they are the same). 请注意,这只是非等效的组合(即我们不显示1-2和2-1的书,只有1-2,因为它们是相同的)。
try this: 试试这个:
## gives you a seperate list for each book
list_bookls <- split(bt$readers, books)
## gives you a seperate list for each reader
list_readers <- split(bt$books, readers)
another form of output with the output as a data.table and giving the number of books read by each reader and the number of books each reader reads: 另一种形式的输出,输出为data.table,给出每个读者阅读的书籍数量和每个读者阅读的书籍数量:
bt[ , .("N Books" = length(unique(books))), by = readers]
bt[ , .("N Readers" = length(unique(readers))), by = readers]
for the second part of your question I would use the following: 对于你的问题的第二部分,我将使用以下内容:
bt2 <- bt[ , .N, by = .(readers, books)]
library(tidyr)
spread(bt2, key = books, value = "N", fill = 0)
Output is a table that gives 1 if the books is read by reader X and 0 otherwise: 输出是一个表,如果读者X读取书籍则为1,否则为0:
readers 1 2 3
1: 10 1 1 0
2: 20 1 0 1
3: 30 1 1 0
Here's a base R solution to test if the pairs were read. 这是一个基本的R解决方案,用于测试是否读取了对。 Someone else can add one for data.table
if you absolutely need to use it: 如果你绝对需要使用它,其他人可以为data.table
添加一个:
books = c (1,2,3,1,1,2)
readers = c(30, 10, 20, 20, 10, 30)
bks = data.frame(books, readers)
cmb <- combn(unique(books), 2)
cmb <- t(cmb)
combos <- as.data.frame(cmb)
bktbl <- t(table(bks))
for (i in 1:nrow(bktbl)) {
x[i] <- sum(bktbl[i, cmb[i, 1]], bktbl[i, cmb[i, 2]])
combos$PairRead <- ifelse(x > 1,"yes", "no")
}
combos
V1 V2 PairRead
1 1 2 yes
2 1 3 yes
3 2 3 no
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.