简体   繁体   中英

Transform flair language model tensors for viewing in TensorBoard Projector

I want to convert "vectors,"

vectors = [token.embedding for token in sentence]
print(type(vectors))
<class 'list'>

print(vectors)
[tensor([ 0.0077, -0.0227, -0.0004,  ...,  0.1377, -0.0003,  0.0028]),
...
tensor([ 0.0003, -0.0461,  0.0043,  ..., -0.0126, -0.0004,  0.0142])]

to

0.0077 -0.0227 -0.0004 ... 0.1377 -0.0003 0.0028
...
0.0003 -0.0461 0.0043 ... -0.0126 -0.0004 0.0142

and write that to a TSV.

Aside: those embeddings are from flair ( https://github.com/zalandoresearch/flair ): how can I get the full output, not the -0.0004 ... 0.1377 abbreviated output?

OK, I dug around ...

  1. It turns out those are PyTorch tensors (Flair uses PyTorch). For a simple conversion to NumPy arrays (per the PyTorch docs at https://pytorch.org/docs/stable/tensors.html#torch.Tensor.tolist and this StackOverFlow answer use tolist() , a PyTorch method.

     >>> import torch >>> a = torch.randn(2, 2) >>> print(a) tensor([[-2.1693, 0.7698], [ 0.0497, 0.8462]]) >>> a.tolist() [[-2.1692984104156494, 0.7698001265525818], [0.049718063324689865, 0.8462421298027039]]

  1. Per my original question, here's how to convert those data to plain text and write them to TSV files.

     from flair.embeddings import FlairEmbeddings, Sentence from flair.models import SequenceTagger from flair.embeddings import StackedEmbeddings embeddings_f = FlairEmbeddings('pubmed-forward') embeddings_b = FlairEmbeddings('pubmed-backward') sentence = Sentence('The RAS-MAPK signalling cascade serves as a central node in transducing signals from membrane receptors to the nucleus.') tagger = SequenceTagger.load('ner') tagger.predict(sentence) embeddings_f.embed(sentence) stacked_embeddings = StackedEmbeddings([ embeddings_f, embeddings_b, ]) stacked_embeddings.embed(sentence) # for token in sentence: # print(token) # print(token.embedding) # print(token.embedding.shape) tokens = [token for token in sentence] print(tokens) ''' [Token: 1 The, Token: 2 RAS-MAPK, Token: 3 signalling, Token: 4 cascade, Token: 5 serves, Token: 6 as, Token: 7 a, Token: 8 central, Token: 9 node, Token: 10 in, Token: 11 transducing, Token: 12 signals, Token: 13 from, Token: 14 membrane, Token: 15 receptors, Token: 16 to, Token: 17 the, Token: 18 nucleus.] ''' ## https://www.geeksforgeeks.org/python-string-split/ tokens = [str(token).split()[2] for token in sentence] print(tokens) ''' ['The', 'RAS-MAPK', 'signalling', 'cascade', 'serves', 'as', 'a', 'central', 'node', 'in', 'transducing', 'signals', 'from', 'membrane', 'receptors', 'to', 'the', 'nucleus.'] ''' tensors = [token.embedding for token in sentence] print(tensors) ''' [tensor([ 0.0077, -0.0227, -0.0004, ..., 0.1377, -0.0003, 0.0028]), tensor([-0.0007, -0.1601, -0.0274, ..., 0.1982, 0.0013, 0.0042]), tensor([ 4.2534e-03, -3.1018e-01, -3.9660e-01, ..., 5.9336e-02, -9.4445e-05, 1.0025e-02]), tensor([ 0.0026, -0.0087, -0.1398, ..., -0.0037, 0.0012, 0.0274]), tensor([-0.0005, -0.0164, -0.0233, ..., -0.0013, 0.0039, 0.0004]), tensor([ 3.8261e-03, -7.6409e-02, -1.8632e-02, ..., -2.8906e-03, -4.4556e-04, 5.6909e-05]), tensor([ 0.0035, -0.0207, 0.1700, ..., -0.0193, 0.0017, 0.0006]), tensor([ 0.0159, -0.4097, -0.0489, ..., 0.0743, 0.0005, 0.0012]), tensor([ 9.7725e-03, -3.3817e-01, -2.2848e-02, ..., -6.6284e-02, 2.3646e-04, 1.0505e-02]), tensor([ 0.0219, -0.0677, -0.0154, ..., 0.0102, 0.0066, 0.0016]), tensor([ 0.0092, -0.0431, -0.0450, ..., 0.0060, 0.0002, 0.0005]), tensor([ 0.0047, -0.2732, -0.0408, ..., 0.0136, 0.0005, 0.0072]), tensor([ 0.0072, -0.0173, -0.0149, ..., -0.0013, -0.0004, 0.0056]), tensor([ 0.0086, -0.1151, -0.0629, ..., 0.0043, 0.0050, 0.0016]), tensor([ 7.6452e-03, -2.3825e-01, -1.5683e-02, ..., -5.4974e-04, -1.4646e-04, 6.6120e-03]), tensor([ 0.0038, -0.0354, -0.1337, ..., 0.0060, -0.0004, 0.0102]), tensor([ 0.0186, -0.0151, -0.0641, ..., 0.0188, 0.0391, 0.0069]), tensor([ 0.0003, -0.0461, 0.0043, ..., -0.0126, -0.0004, 0.0142])] ''' # ---------------------------------------- ## Write those data to TSV files. ## https://stackoverflow.com/a/29896136/1904943 import csv metadata_f = 'metadata.tsv' tensors_f = 'tensors.tsv' with open(metadata_f, 'w', encoding='utf8', newline='') as tsv_file: tsv_writer = csv.writer(tsv_file, delimiter='\\t', lineterminator='\\n') for token in tokens: ## Assign to a dummy variable ( _ ) to suppress character counts; ## if I use (token), rather than ([token]), I get spaces between all characters: _ = tsv_writer.writerow([token]) ## metadata.tsv : ''' The RAS-MAPK signalling cascade serves as a central node in transducing signals from membrane receptors to the nucleus. ''' with open(metadata_f, 'w', encoding='utf8', newline='') as tsv_file: tsv_writer = csv.writer(tsv_file, delimiter='\\t', lineterminator='\\n') _ = tsv_writer.writerow(tokens) ## metadata.tsv : ''' The RAS-MAPK signalling cascade serves as a central node in transducing signals from membrane receptors to the nucleus. ''' tensors = [token.embedding for token in sentence] print(tensors) ''' [tensor([ 0.0077, -0.0227, -0.0004, ..., 0.1377, -0.0003, 0.0028]), tensor([-0.0007, -0.1601, -0.0274, ..., 0.1982, 0.0013, 0.0042]), tensor([ 4.2534e-03, -3.1018e-01, -3.9660e-01, ..., 5.9336e-02, -9.4445e-05, 1.0025e-02]), tensor([ 0.0026, -0.0087, -0.1398, ..., -0.0037, 0.0012, 0.0274]), tensor([-0.0005, -0.0164, -0.0233, ..., -0.0013, 0.0039, 0.0004]), tensor([ 3.8261e-03, -7.6409e-02, -1.8632e-02, ..., -2.8906e-03, -4.4556e-04, 5.6909e-05]), tensor([ 0.0035, -0.0207, 0.1700, ..., -0.0193, 0.0017, 0.0006]), tensor([ 0.0159, -0.4097, -0.0489, ..., 0.0743, 0.0005, 0.0012]), tensor([ 9.7725e-03, -3.3817e-01, -2.2848e-02, ..., -6.6284e-02, 2.3646e-04, 1.0505e-02]), tensor([ 0.0219, -0.0677, -0.0154, ..., 0.0102, 0.0066, 0.0016]), tensor([ 0.0092, -0.0431, -0.0450, ..., 0.0060, 0.0002, 0.0005]), tensor([ 0.0047, -0.2732, -0.0408, ..., 0.0136, 0.0005, 0.0072]), tensor([ 0.0072, -0.0173, -0.0149, ..., -0.0013, -0.0004, 0.0056]), tensor([ 0.0086, -0.1151, -0.0629, ..., 0.0043, 0.0050, 0.0016]), tensor([ 7.6452e-03, -2.3825e-01, -1.5683e-02, ..., -5.4974e-04, -1.4646e-04, 6.6120e-03]), tensor([ 0.0038, -0.0354, -0.1337, ..., 0.0060, -0.0004, 0.0102]), tensor([ 0.0186, -0.0151, -0.0641, ..., 0.0188, 0.0391, 0.0069]), tensor([ 0.0003, -0.0461, 0.0043, ..., -0.0126, -0.0004, 0.0142])] ''' with open(tensors_f, 'w', encoding='utf8', newline='') as tsv_file: tsv_writer = csv.writer(tsv_file, delimiter='\\t', lineterminator='\\n') for token in sentence: embedding = token.embedding _ = tsv_writer.writerow(embedding.tolist()) ## tensors.tsv (18 lines: one embedding per token in metadata.tsv): ## note: enormous output, even for this simple sentence. ''' 0.007691788021475077 -0.02268664352595806 -0.0004340760060586035 ... '''

  1. Last, my intention for all of that was to load contextual language embeddings (Flair, etc.) into TensorFlow's Embedding Projector . It turns out all I needed to to was to convert (here, Flair data) to NumPy arrays, and load them into a TensorFlow TensorBoard instance (no need for TSV files!).

    I describe that in detail in my blog post, here: Visualizing Language Model Tensors (Embeddings) in TensorFlow's TensorBoard [TensorBoard Projector: PCA; t-SNE; ...] .

For getting the tokens you can use the token.text and token.embedding.tolist() to get the embeddings:

def flair_embeddings(sentences, output_file=None):
    if output_file:
        f = open(output_file, 'w')
    # init embedding
    flair_embedding_forward = FlairEmbeddings('news-forward')
    
    for i, sent in enumerate(sentences):
        print("Encoding the {}th input sentence!".format(i))
        # create a sentence
        sentence = Sentence(sent)

        # embed words in sentence
        flair_embedding_forward.embed(sentence)

        for token in sentence:
            if output_file:

                f.write(token.text + "\t" + "\t".join([str(num) for num in token.embedding.tolist()]) + '\n')
            else:
                print(token.text + "\t" + "\t".join([str(num) for num in token.embedding.tolist()]) + '\n')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM