简体   繁体   中英

How to use Elmo word embedding with the original pre-trained model (5.5B) in interactive mode

I am trying to learn how to use Elmo embeddings via this tutorial:

https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md

I am specifically trying to use the interactive mode as described like this:

$ ipython
> from allennlp.commands.elmo import ElmoEmbedder
> elmo = ElmoEmbedder()
> tokens = ["I", "ate", "an", "apple", "for", "breakfast"]
> vectors = elmo.embed_sentence(tokens)

> assert(len(vectors) == 3) # one for each layer in the ELMo output
> assert(len(vectors[0]) == len(tokens)) # the vector elements 
correspond with the input tokens

> import scipy
> vectors2 = elmo.embed_sentence(["I", "ate", "a", "carrot", "for", 
"breakfast"])
> scipy.spatial.distance.cosine(vectors[2][3], vectors2[2][3]) # cosine 
distance between "apple" and "carrot" in the last layer
0.18020617961883545

My overall question is how do I make sure to use the pre-trained elmo model on the original 5.5B set (described here: https://allennlp.org/elmo )?

I don't quite understand why we have to call "assert" or why we use the [2][3] indexing on the vector output.

My ultimate purpose is to average the all the word embeddings in order to get a sentence embedding, so I want to make sure I do it right!

Thanks for your patience as I am pretty new in all this.

By default, ElmoEmbedder uses the Original weights and options from the pretrained models on the 1 Bil Word benchmark. About 800 million tokens. To ensure you're using the largest model, look at the arguments of the ElmoEmbedder class. From here you could probably figure out that you can set the options and weights of the model:

elmo = ElmoEmbedder(
    options_file='https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json', 
    weight_file='https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_5.5B/elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5'
)

I got these links from the pretrained models table provided by AllenNLP.


assert is a convenient way to test and ensure specific values of variables. This looks like a good resource to read more. For example, the first assert statement ensure the embedding has three output matrices.


Going off of that, we index with [i][j] because the model outputs 3 layer matrices (where we choose the i-th) and each matrix has n tokens (where we choose the j-th) each of length 1024. Notice how the code compares the similarity of "apple" and "carrot", both of which are the 4th token at index j=3. From the example documentation, i represents one of:

The first layer corresponds to the context insensitive token representation, followed by the two LSTM layers. See the ELMo paper or follow up work at EMNLP 2018 for a description of what types of information is captured in each layer.

The paper provides the details on those two LSTM layers.


Lastly, if you have a set of sentences, with ELMO you don't need to average the token vectors. The model is a character-wise LSTM, which works perfectly fine on tokenized whole sentences. Use one of the methods designed for working with sets of sentences: embed_sentences() , embed_batch() , etc. More in the code !

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM