為句子中的每個標記獲取 Bert 嵌入

Question

我在 python 中有一個 dataframe ，其中有一列文本數據。 我需要運行一個循環，在該循環中，我將獲取該文本列中的每一行，並為該特定行中的每個標記獲取 bert 嵌入。 然后我需要 append 那些向量嵌入並出於某種目的進行嘗試。

例如“我的名字是奧巴馬”為“我”獲取 768 個向量嵌入為“姓名”獲取 768 個向量嵌入為“是”獲取 768 個向量嵌入為“奧巴馬”獲取 768 個向量嵌入

最終 output：大小為 768*4 = 3072 的向量嵌入假設每一行都有確切的單詞數

Answer 1

我相信您正在嘗試將句子中單個單詞的基於上下文的嵌入帶入圖片，而不是像 GloVe 那樣的固定向量。 你的方法應該是。

將您的段落標記為單個句子（如果適用，請查看一些句子標記器或 SBD（句子邊界檢測）方法）
現在對於構成段落的每個句子，獲取單詞的嵌入。
將其平均，以便您在多個段落中獲得一致形狀的向量（在您的情況下為 dataframe 單元格 - 本質上是段落）

pip install sentence-transformers

一旦安裝；

model = SentenceTransformer('paraphrase-distilroberta-base-v1')

#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

查看嵌入向量和圍繞嵌入的聚合技術。

為句子中的每個標記獲取 Bert 嵌入

問題描述

1 個解決方案

解決方案1
1 2021-03-03 11:05:36

為句子中的每個標記獲取 Bert 嵌入

問題描述

1 個解決方案

解決方案1 1 2021-03-03 11:05:36

解決方案1
1 2021-03-03 11:05:36