[英]Glove Word Mover Similarity
I want to calculate text similarity using relaxed word movers distance. 我想使用轻松的字移动器距离来计算文本相似度。 I have two different datasets (corpus). 我有两个不同的数据集(语料库)。 See below. 见下文。
A <- data.frame(name = c(
"X-ray right leg arteries",
"consultation of gynecologist",
"x-ray leg arteries",
"x-ray leg with 20km distance",
"x-ray left hand"
), stringsAsFactors = F)
B <- data.frame(name = c(
"X-ray left leg arteries",
"consultation (inspection) of gynecalogist",
"MRI right leg arteries",
"X-ray right leg arteries with special care"
), stringsAsFactors = F)
I am using text2vec package in R. 我在R中使用text2vec包。
library(text2vec)
library(stringr)
prep_fun = function(x) {
x %>%
# make text lower case
str_to_lower %>%
# remove non-alphanumeric symbols
str_replace_all("[^[:alnum:]]", " ") %>%
# collapse multiple spaces
str_replace_all("\\s+", " ")
}
Combine both datasets
C = rbind(A, B)
C$name = prep_fun(C$name)
it = itoken(C$name, progressbar = FALSE)
v = create_vocabulary(it) %>% prune_vocabulary()
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
tcm = create_tcm(it, vectorizer, skip_grams_window = 3)
glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3)
wv = glove_model$fit_transform(tcm, n_iter = 10)
# get average of main and context vectors as proposed in GloVe paper
wv = wv + t(glove_model$components)
rwmd_model = RWMD$new(wv)
rwmd_dist = dist2(dtm[1:nrow(A), ], dtm[nrow(A)+1:nrow(C), ], method = rwmd_model, norm = 'none')
head(rwmd_dist)
[,1] [,2] [,3] [,4]
[1,] 0.1220713 0.7905035 0.3085216 0.4182328
[2,] 0.7043127 0.1883473 0.8031200 0.7038919
[3,] 0.1220713 0.7905035 0.3856520 0.4836772
[4,] 0.5340587 0.6259011 0.7146630 0.2513135
[5,] 0.3403019 0.5575993 0.7568583 0.5124514
Does skip_grams_window = 3
in tcm = create_tcm(it, vectorizer, skip_grams_window = 3)
code means checking 3 words to right while creating co-occurrence matrix? 确实skip_grams_window = 3
在tcm = create_tcm(it, vectorizer, skip_grams_window = 3)
代码装置检查3个字到右,同时创造共生矩阵? For example, text 'X-ray right leg arteries' would become vector - target :'X-ray' 例如,文字“ X射线右腿动脉”将成为矢量-目标:“ X射线”
right leg arteries
1 1 1
What's the use of word_vectors_size
? word_vectors_size
的用途是word_vectors_size
? I have read algorithm of glove but failed to understand the use of this function. 我已经阅读了手套的算法,但无法理解此功能的用法。
glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3) Gloves_model = GloVe $ new(word_vectors_size = 10,词汇= v,x_max = 3)
Suggest specifying the skip_grams_window_context
(valid values : "symmetric"
, "right"
, or "left"
) along with the skip_grams_window
argument . 建议与skip_grams_window
参数一起指定skip_grams_window_context
(有效值: "symmetric"
, "right"
或"left"
)。 [Documentation] [文档]
The word_vectors_size
argument is used to define the dimension of the underlying word vectors. word_vectors_size
参数用于定义基础单词向量的维数。 What this means is that each word is transformed into a vector in an N-dimension vector space. 这意味着每个单词都将在N维向量空间中转换为向量。 There are a few articles with a good explanation of word vectors ( article 1 and article 2 ). 有几篇文章对词向量有很好的解释( 第1条和第2条 )。
In your example, glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3)
, it implies 10-dimension word vectors. 在您的示例中, glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3)
,它意味着10维单词向量。
Picking the suitable dimension for the word vectors is important. 为单词向量选择合适的维度很重要。 According to this reply in October 2014, 根据这个在2014年10月的回复,
Typical interval is between 100-300. 典型的间隔是100-300之间。 I would say you need at least 50D to achieve lowest accuracy. 我要说您至少需要50D才能达到最低的精度。 If you pick lesser number of dimensions, you will start to lose properties of high dimensional spaces. 如果选择的维数较少,则将开始失去高维空间的属性。 If training time is not a big deal for your application, i would stick with 200D dimensions as it gives nice features. 如果培训时间对您的应用程序来说不是很重要,那么我会坚持使用200D尺寸,因为它具有不错的功能。 Extreme accuracy can be obtained with 300D. 使用300D可获得极高的精度。 After 300D word features won't improve dramatically, and training will be extremely slow. 300D单词功能将不会显着改善,并且训练将非常缓慢。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.