简体   繁体   English

如何将列表列的项目转换为自己的列以在R中找到余弦相似度?

[英]How to convert the items of a list column into their own columns to find cosine similarity in R?

I have a data set that looks like this: 我有一个数据集,看起来像这样:

library(tidyverse)

data <- tibble(id = 1:10,
               vectors = list(rnorm(25)))

# A tibble: 25 x 2
      id vectors   
   <int> <list>    
 1     1 <dbl [25]>
 2     2 <dbl [25]>
 3     3 <dbl [25]>
 4     4 <dbl [25]>
 5     5 <dbl [25]>
 6     6 <dbl [25]>
 7     7 <dbl [25]>
 8     8 <dbl [25]>
 9     9 <dbl [25]>
10    10 <dbl [25]>

I'd like to use this data set to find cosine similarity where each row represents a document. 我想使用此数据集来查找余弦相似度,其中每一行代表一个文档。 The cosine function from the lsa package seems like a good/easy way to do this, however I would need each document represented as a column. lsa包中的cosine函数似乎是执行此操作的好/容易方法,但是我需要将每个文档都表示为一列。 I'd like to simply to do data %>% t() to get my desired result, but that's not working. 我只想简单地执行data %>% t()即可得到我想要的结果,但这是行不通的。 I've also tried "spreading" the list column first using unest and spread . 我还尝试过先使用unestspread来“扩展”列表列。 I've also tried flatten to no avail. 我也尝试过flatten无济于事。 The first line of my desired output would look something like: 我期望的输出的第一行看起来像:

  1    2    3    4    5    6    7    8    9    10
0.1  0.3  0.7  0.3  0.1  0.1  0.3  0.7  0.3  0.1

If there's a function from another package that handles data in this format I would by all means just use that instead though at this point I would like to figure this out from a curiosity standpoint. 如果另一个程序包中有一个函数可以处理这种格式的数据,那么我绝对会使用它,尽管在这一点上我想从好奇的角度来解决这个问题。 I've looked at R - list to data frame , but I'm not sure how I can apply that to this situation. 我看过R-list到数据框 ,但不确定如何将其应用于这种情况。

The background to this is that I've performed doc2vec in python with gensim but do to our environment in work, if I want to build something interactive for a client it would need to be in R. 这样做的背景是,我已经使用gensim在python中执行了doc2vec,但是在工作中对我们的环境有所影响,如果我想为客户端构建交互式的内容,则需要在R中使用。

require(dplyr)
require(tidyr)
mutate(data,vectors=sapply(vectors, function(x) paste(x,collapse=","))) %>% 
    separate_rows(vectors,sep=",") %>% 
    group_by(id) %>% 
    mutate(numb=row_number(),vectors=as.numeric(vectors)) %>%
    spread(key=numb,value=vectors)

# A tibble: 10 x 26
# Groups:   id [10]
      id   `1`   `2`   `3`   `4`    `5`   `6`    `7`   `8`     `9`  `10`  `11`  `12`   `13`   `14`  `15`   `16`
   <int> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>
 1     1  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
 2     2  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
 3     3  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
 4     4  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
 5     5  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
 6     6  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
 7     7  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
 8     8  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
 9     9  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
10    10  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
# ... with 9 more variables: `17` <dbl>, `18` <dbl>, `19` <dbl>, `20` <dbl>, `21` <dbl>, `22` <dbl>, `23` <dbl>,
#   `24` <dbl>, `25` <dbl>

I find it's easiest to spread data by first gathering it into a long-data format. 我发现通过首先将数据收集为长数据格式来传播数据是最容易的。 We achieve that using separate_rows . 我们使用separate_rows实现了这一点。 The problem there is that we first need to transform the lists in vectors into something separate_rows can work with. 这个问题存在,我们首先需要在矢量变换名单弄成separate_rows可以工作。 We do that using paste with collapse="," within a sapply (otherwise all the lists will be pasted together). 我们在一个apply中使用paste ,然后使用collapse=","粘贴(否则所有列表都将粘贴在一起)。

Once we have that it's just a matter of grouping, adding a row-index column (and transforming the numbers back to numeric), and spreading to achieve the desired format. 有了这些信息后,只需进行分组即可,添加一个行索引列(并将数字转换回数字),然后进行扩展以实现所需的格式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM