简体   繁体   中英

Glove Word Mover Similarity

I want to calculate text similarity using relaxed word movers distance. I have two different datasets (corpus). See below.

A <- data.frame(name = c(
  "X-ray right leg arteries",
  "consultation of gynecologist",
  "x-ray leg arteries",
  "x-ray leg with 20km distance",
  "x-ray left hand"
), stringsAsFactors = F)

B <- data.frame(name = c(
  "X-ray left leg arteries",
  "consultation (inspection) of gynecalogist",
  "MRI right leg arteries",
  "X-ray right leg arteries with special care"
), stringsAsFactors = F)

I am using text2vec package in R.

library(text2vec)
library(stringr)
prep_fun = function(x) {
  x %>% 
    # make text lower case
    str_to_lower %>% 
    # remove non-alphanumeric symbols
    str_replace_all("[^[:alnum:]]", " ") %>% 
    # collapse multiple spaces
    str_replace_all("\\s+", " ")
}
Combine both datasets
C = rbind(A, B)

C$name = prep_fun(C$name)

it = itoken(C$name, progressbar = FALSE)
v = create_vocabulary(it) %>% prune_vocabulary()
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
tcm = create_tcm(it, vectorizer, skip_grams_window = 3)
glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3)
wv = glove_model$fit_transform(tcm, n_iter = 10)

# get average of main and context vectors as proposed in GloVe paper
wv = wv + t(glove_model$components)
rwmd_model = RWMD$new(wv)
rwmd_dist = dist2(dtm[1:nrow(A), ], dtm[nrow(A)+1:nrow(C), ], method = rwmd_model, norm = 'none')

head(rwmd_dist)

          [,1]      [,2]      [,3]      [,4]
[1,] 0.1220713 0.7905035 0.3085216 0.4182328
[2,] 0.7043127 0.1883473 0.8031200 0.7038919
[3,] 0.1220713 0.7905035 0.3856520 0.4836772
[4,] 0.5340587 0.6259011 0.7146630 0.2513135
[5,] 0.3403019 0.5575993 0.7568583 0.5124514

Does skip_grams_window = 3 in tcm = create_tcm(it, vectorizer, skip_grams_window = 3) code means checking 3 words to right while creating co-occurrence matrix? For example, text 'X-ray right leg arteries' would become vector - target :'X-ray'

right   leg arteries
1   1   1

What's the use of word_vectors_size ? I have read algorithm of glove but failed to understand the use of this function.

glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3)

Suggest specifying the skip_grams_window_context (valid values : "symmetric" , "right" , or "left" ) along with the skip_grams_window argument . [Documentation]

The word_vectors_size argument is used to define the dimension of the underlying word vectors. What this means is that each word is transformed into a vector in an N-dimension vector space. There are a few articles with a good explanation of word vectors ( article 1 and article 2 ).

In your example, glove_model = GloVe$new(word_vectors_size = 10, vocabulary = v, x_max = 3) , it implies 10-dimension word vectors.

Picking the suitable dimension for the word vectors is important. According to this reply in October 2014,

Typical interval is between 100-300. I would say you need at least 50D to achieve lowest accuracy. If you pick lesser number of dimensions, you will start to lose properties of high dimensional spaces. If training time is not a big deal for your application, i would stick with 200D dimensions as it gives nice features. Extreme accuracy can be obtained with 300D. After 300D word features won't improve dramatically, and training will be extremely slow.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM