手套词移动相似

Question

I want to calculate text similarity using relaxed word movers distance. 我想使用轻松的字移动器距离来计算文本相似度。 I have two different datasets (corpus). 我有两个不同的数据集（语料库）。 See below. 见下文。

A <- data.frame(name = c(
  "X-ray right leg arteries",
  "consultation of gynecologist",
  "x-ray leg arteries",
  "x-ray leg with 20km distance",
  "x-ray left hand"
), stringsAsFactors = F)

B <- data.frame(name = c(
  "X-ray left leg arteries",
  "consultation (inspection) of gynecalogist",
  "MRI right leg arteries",
  "X-ray right leg arteries with special care"
), stringsAsFactors = F)

I am using text2vec package in R. 我在R中使用text2vec包。

library(text2vec)
library(stringr)
prep_fun = function(x) {
  x %>% 
    # make text lower case
    str_to_lower %>% 
    # remove non-alphanumeric symbols
    str_replace_all("[^[:alnum:]]", " ") %>% 
    # collapse multiple spaces
    str_replace_all("\\s+", " ")
}
Combine both datasets
C = rbind(A, B)

C$name = prep_fun(C$name)

it = itoken(C$name, progressbar = FALSE)
v = create_vocabulary(it) %>% prune_vocabulary()
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
tcm = create_tcm(it, vectorizer, skip_grams_window = 3)
glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3)
wv = glove_model$fit_transform(tcm, n_iter = 10)

# get average of main and context vectors as proposed in GloVe paper
wv = wv + t(glove_model$components)
rwmd_model = RWMD$new(wv)
rwmd_dist = dist2(dtm[1:nrow(A), ], dtm[nrow(A)+1:nrow(C), ], method = rwmd_model, norm = 'none')

head(rwmd_dist)

          [,1]      [,2]      [,3]      [,4]
[1,] 0.1220713 0.7905035 0.3085216 0.4182328
[2,] 0.7043127 0.1883473 0.8031200 0.7038919
[3,] 0.1220713 0.7905035 0.3856520 0.4836772
[4,] 0.5340587 0.6259011 0.7146630 0.2513135
[5,] 0.3403019 0.5575993 0.7568583 0.5124514

Does skip_grams_window = 3 in tcm = create_tcm(it, vectorizer, skip_grams_window = 3) code means checking 3 words to right while creating co-occurrence matrix? 确实skip_grams_window = 3在tcm = create_tcm(it, vectorizer, skip_grams_window = 3)代码装置检查3个字到右，同时创造共生矩阵？ For example, text 'X-ray right leg arteries' would become vector - target :'X-ray' 例如，文字“ X射线右腿动脉”将成为矢量-目标：“ X射线”

right   leg arteries
1   1   1

What's the use of word_vectors_size ? word_vectors_size的用途是word_vectors_size ？ I have read algorithm of glove but failed to understand the use of this function. 我已经阅读了手套的算法，但无法理解此功能的用法。

glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3) Gloves_model = GloVe $ new（word_vectors_size = 10，词汇= v，x_max = 3）

Answer 1

Suggest specifying the skip_grams_window_context (valid values : "symmetric" , "right" , or "left" ) along with the skip_grams_window argument . 建议与skip_grams_window参数一起指定skip_grams_window_context （有效值： "symmetric" ， "right"或"left" ）。 [Documentation] [文档]

The word_vectors_size argument is used to define the dimension of the underlying word vectors. word_vectors_size参数用于定义基础单词向量的维数。 What this means is that each word is transformed into a vector in an N-dimension vector space. 这意味着每个单词都将在N维向量空间中转换为向量。 There are a few articles with a good explanation of word vectors ( article 1 and article 2 ). 有几篇文章对词向量有很好的解释（第1条和第2条）。

In your example, glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3) , it implies 10-dimension word vectors. 在您的示例中， glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3) ，它意味着10维单词向量。

Picking the suitable dimension for the word vectors is important. 为单词向量选择合适的维度很重要。 According to this reply in October 2014, 根据这个在2014年10月的回复，

Typical interval is between 100-300. 典型的间隔是100-300之间。 I would say you need at least 50D to achieve lowest accuracy. 我要说您至少需要50D才能达到最低的精度。 If you pick lesser number of dimensions, you will start to lose properties of high dimensional spaces. 如果选择的维数较少，则将开始失去高维空间的属性。 If training time is not a big deal for your application, i would stick with 200D dimensions as it gives nice features. 如果培训时间对您的应用程序来说不是很重要，那么我会坚持使用200D尺寸，因为它具有不错的功能。 Extreme accuracy can be obtained with 300D. 使用300D可获得极高的精度。 After 300D word features won't improve dramatically, and training will be extremely slow. 300D单词功能将不会显着改善，并且训练将非常缓慢。

手套词移动相似

问题描述

1 个解决方案

解决方案1
0 2018-09-09 00:27:30

手套词移动相似

问题描述

1 个解决方案

解决方案1 0 2018-09-09 00:27:30

解决方案1
0 2018-09-09 00:27:30