简体   繁体   English

手套词移动相似

[英]Glove Word Mover Similarity

I want to calculate text similarity using relaxed word movers distance. 我想使用轻松的字移动器距离来计算文本相似度。 I have two different datasets (corpus). 我有两个不同的数据集(语料库)。 See below. 见下文。

A <- data.frame(name = c(
  "X-ray right leg arteries",
  "consultation of gynecologist",
  "x-ray leg arteries",
  "x-ray leg with 20km distance",
  "x-ray left hand"
), stringsAsFactors = F)

B <- data.frame(name = c(
  "X-ray left leg arteries",
  "consultation (inspection) of gynecalogist",
  "MRI right leg arteries",
  "X-ray right leg arteries with special care"
), stringsAsFactors = F)

I am using text2vec package in R. 我在R中使用text2vec包。

library(text2vec)
library(stringr)
prep_fun = function(x) {
  x %>% 
    # make text lower case
    str_to_lower %>% 
    # remove non-alphanumeric symbols
    str_replace_all("[^[:alnum:]]", " ") %>% 
    # collapse multiple spaces
    str_replace_all("\\s+", " ")
}
Combine both datasets
C = rbind(A, B)

C$name = prep_fun(C$name)

it = itoken(C$name, progressbar = FALSE)
v = create_vocabulary(it) %>% prune_vocabulary()
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
tcm = create_tcm(it, vectorizer, skip_grams_window = 3)
glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3)
wv = glove_model$fit_transform(tcm, n_iter = 10)

# get average of main and context vectors as proposed in GloVe paper
wv = wv + t(glove_model$components)
rwmd_model = RWMD$new(wv)
rwmd_dist = dist2(dtm[1:nrow(A), ], dtm[nrow(A)+1:nrow(C), ], method = rwmd_model, norm = 'none')

head(rwmd_dist)

          [,1]      [,2]      [,3]      [,4]
[1,] 0.1220713 0.7905035 0.3085216 0.4182328
[2,] 0.7043127 0.1883473 0.8031200 0.7038919
[3,] 0.1220713 0.7905035 0.3856520 0.4836772
[4,] 0.5340587 0.6259011 0.7146630 0.2513135
[5,] 0.3403019 0.5575993 0.7568583 0.5124514

Does skip_grams_window = 3 in tcm = create_tcm(it, vectorizer, skip_grams_window = 3) code means checking 3 words to right while creating co-occurrence matrix? 确实skip_grams_window = 3tcm = create_tcm(it, vectorizer, skip_grams_window = 3)代码装置检查3个字到右,同时创造共生矩阵? For example, text 'X-ray right leg arteries' would become vector - target :'X-ray' 例如,文字“ X射线右腿动脉”将成为矢量-目标:“ X射线”

right   leg arteries
1   1   1

What's the use of word_vectors_size ? word_vectors_size的用途是word_vectors_size I have read algorithm of glove but failed to understand the use of this function. 我已经阅读了手套的算法,但无法理解此功能的用法。

glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3) Gloves_model = GloVe $ new(word_vectors_size = 10,词汇= v,x_max = 3)

Suggest specifying the skip_grams_window_context (valid values : "symmetric" , "right" , or "left" ) along with the skip_grams_window argument . 建议与skip_grams_window参数一起指定skip_grams_window_context (有效值: "symmetric""right""left" )。 [Documentation] [文档]

The word_vectors_size argument is used to define the dimension of the underlying word vectors. word_vectors_size参数用于定义基础单词向量的维数。 What this means is that each word is transformed into a vector in an N-dimension vector space. 这意味着每个单词都将在N维向量空间中转换为向量。 There are a few articles with a good explanation of word vectors ( article 1 and article 2 ). 有几篇文章对词向量有很好的解释( 第1条第2条 )。

In your example, glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3) , it implies 10-dimension word vectors. 在您的示例中, glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3) ,它意味着10维单词向量。

Picking the suitable dimension for the word vectors is important. 为单词向量选择合适的维度很重要。 According to this reply in October 2014, 根据这个在2014年10月的回复,

Typical interval is between 100-300. 典型的间隔是100-300之间。 I would say you need at least 50D to achieve lowest accuracy. 我要说您至少需要50D才能达到最低的精度。 If you pick lesser number of dimensions, you will start to lose properties of high dimensional spaces. 如果选择的维数较少,则将开始失去高维空间的属性。 If training time is not a big deal for your application, i would stick with 200D dimensions as it gives nice features. 如果培训时间对您的应用程序来说不是很重要,那么我会坚持使用200D尺寸,因为它具有不错的功能。 Extreme accuracy can be obtained with 300D. 使用300D可获得极高的精度。 After 300D word features won't improve dramatically, and training will be extremely slow. 300D单词功能将不会显着改善,并且训练将非常缓慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R中的词移动距离相似度 - Word Mover Distance Similarity in R 包含情绪的 GloVe 词嵌入? - GloVe word embeddings containing sentiment? 使用 Earth Mover 距离 (EMD) 计算 .tif 栅格之间空间利用率的相似性 - Calculate similarity in spatial utilization between .tif rasters using Earth Mover's Distance (EMD) 使用抛出NullPointerException的word2vec函数将手套导入h2o - Importing Glove to h2o with word2vec function throwing NullPointerException 使用GLOVEs预训练的Gloves.6B.50.txt作为词嵌入的基础R - Using GLOVEs pretrained glove.6B.50.txt as a basis for word embeddings R 如何在 R 中找到单词和单词列表之间的语义相似度? - How to find the semantic similarity between a word and a list of words in R? 如何计算预训练词嵌入的相似度 - How to calculate similarity for pre-trained word embeddings 逐字确定多词串的(异)相似性 - Determine (dis)similarity of multi-word strings on a word-by-word basis 使用R中的tex2vec进行手套词嵌入模型参数,并每n次迭代后显示训练输出(历元) - Glove word embedding model parameters using tex2vec in R, and display training output (epochs) after every n iterations 我如何使用Glove词嵌入构建模型并使用R中的text2vec对测试数据进行预测 - How do i build a model using Glove word embeddings and predict on Test data using text2vec in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM