简体   繁体   English

Pandas 数据框列的 BERT Word Embedding

[英]BERT Word Embedding for column of pandas data frame

I m working on a NLP project using Tamil Universal Dependency dataset.我正在使用泰米尔语通用依赖数据集进行 NLP 项目。 I have preprocessed the data into a data frame, of which columns are tokens and its dependency tags.我已经将数据预处理成一个数据框,其中的列是标记及其依赖项标签。 I would like to perform word embedding using mBERT model.我想使用 mBERT 模型执行词嵌入。 Since the dataset is a pretrained model, it is already tokenized as seen in the attached Data frame.由于数据集是一个预训练模型,它已经被标记化,如附加的数据框所示。 I m not sure how to proceed because, when tokens are converted to token id's are wrongly marked by the tokenizer.我不确定如何继续,因为当令牌转换为令牌 ID 时,令牌生成器错误地标记了它。

b #List of tokens

Data Frame数据帧

在此处输入图片说明

Token ID Error令牌 ID 错误

在此处输入图片说明

You can find some example code and explanations here: https://discuss.huggingface.co/t/generate-raw-word-embeddings-using-transformer-models-like-bert-for-downstream-process/2958您可以在这里找到一些示例代码和解释: https : //discuss.huggingface.co/t/generate-raw-word-embeddings-using-transformer-models-like-bert-for-downstream-process/2958

Important point: the added value of BERT is to generate contextualised embeddings, ie embeddings for longer sequences of text (the context) where the embeddings for each individual word change depending on the surrounding words.重要的一点:BERT 的附加值是生成上下文嵌入,即更长文本序列(上下文)的嵌入,其中每个单词的嵌入根据周围的单词而变化。 When you only want static embeddings for individual words (independent of context), then BERT is not the right tool and its better to use static embeddings like Glove, Word2Vec, FastText.如果您只需要单个单词的静态嵌入(独立于上下文),那么 BERT 不是正确的工具,最好使用静态嵌入,如 Glove、Word2Vec、FastText。 Its well known that BERT does not produce good individual word embeddings.众所周知,BERT 不会产生好的单个词嵌入。

What makes sense for you depends on your use-case, but the way you have preprocessed your text indicates that you actually want static embeddings.什么对您有意义取决于您的用例,但您预处理文本的方式表明您实际上想要静态嵌入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM