[英]How to convert the items of a list column into their own columns to find cosine similarity in R?
I have a data set that looks like this: 我有一个数据集,看起来像这样:
library(tidyverse)
data <- tibble(id = 1:10,
vectors = list(rnorm(25)))
# A tibble: 25 x 2
id vectors
<int> <list>
1 1 <dbl [25]>
2 2 <dbl [25]>
3 3 <dbl [25]>
4 4 <dbl [25]>
5 5 <dbl [25]>
6 6 <dbl [25]>
7 7 <dbl [25]>
8 8 <dbl [25]>
9 9 <dbl [25]>
10 10 <dbl [25]>
I'd like to use this data set to find cosine similarity where each row represents a document. 我想使用此数据集来查找余弦相似度,其中每一行代表一个文档。 The
cosine
function from the lsa
package seems like a good/easy way to do this, however I would need each document represented as a column. lsa
包中的cosine
函数似乎是执行此操作的好/容易方法,但是我需要将每个文档都表示为一列。 I'd like to simply to do data %>% t()
to get my desired result, but that's not working. 我只想简单地执行
data %>% t()
即可得到我想要的结果,但这是行不通的。 I've also tried "spreading" the list column first using unest
and spread
. 我还尝试过先使用
unest
和spread
来“扩展”列表列。 I've also tried flatten
to no avail. 我也尝试过
flatten
无济于事。 The first line of my desired output would look something like: 我期望的输出的第一行看起来像:
1 2 3 4 5 6 7 8 9 10
0.1 0.3 0.7 0.3 0.1 0.1 0.3 0.7 0.3 0.1
If there's a function from another package that handles data in this format I would by all means just use that instead though at this point I would like to figure this out from a curiosity standpoint. 如果另一个程序包中有一个函数可以处理这种格式的数据,那么我绝对会使用它,尽管在这一点上我想从好奇的角度来解决这个问题。 I've looked at R - list to data frame , but I'm not sure how I can apply that to this situation.
我看过R-list到数据框 ,但不确定如何将其应用于这种情况。
The background to this is that I've performed doc2vec in python with gensim but do to our environment in work, if I want to build something interactive for a client it would need to be in R. 这样做的背景是,我已经使用gensim在python中执行了doc2vec,但是在工作中对我们的环境有所影响,如果我想为客户端构建交互式的内容,则需要在R中使用。
require(dplyr)
require(tidyr)
mutate(data,vectors=sapply(vectors, function(x) paste(x,collapse=","))) %>%
separate_rows(vectors,sep=",") %>%
group_by(id) %>%
mutate(numb=row_number(),vectors=as.numeric(vectors)) %>%
spread(key=numb,value=vectors)
# A tibble: 10 x 26
# Groups: id [10]
id `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11` `12` `13` `14` `15` `16`
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
2 2 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
3 3 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
4 4 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
5 5 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
6 6 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
7 7 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
8 8 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
9 9 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
10 10 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
# ... with 9 more variables: `17` <dbl>, `18` <dbl>, `19` <dbl>, `20` <dbl>, `21` <dbl>, `22` <dbl>, `23` <dbl>,
# `24` <dbl>, `25` <dbl>
I find it's easiest to spread data by first gathering it into a long-data format. 我发现通过首先将数据收集为长数据格式来传播数据是最容易的。 We achieve that using
separate_rows
. 我们使用
separate_rows
实现了这一点。 The problem there is that we first need to transform the lists in vectors into something separate_rows
can work with. 这个问题存在,我们首先需要在矢量变换名单弄成
separate_rows
可以工作。 We do that using paste
with collapse=","
within a sapply (otherwise all the lists will be pasted together). 我们在一个apply中使用
paste
,然后使用collapse=","
粘贴(否则所有列表都将粘贴在一起)。
Once we have that it's just a matter of grouping, adding a row-index column (and transforming the numbers back to numeric), and spreading to achieve the desired format. 有了这些信息后,只需进行分组即可,添加一个行索引列(并将数字转换回数字),然后进行扩展以实现所需的格式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.