从词汇表中找出单词

Question

I have some texts in a pandas dataframe df['mytext'] I have also got a vocabulary vocab (list of words).我在 pandas dataframe df['mytext']中有一些文本我还有一个词汇表（ vocab列表）。

I am trying to list and count the words out of vocabulary for each document我正在尝试列出并计算每个文档的词汇表中的单词

I have tried the following but it is quite slow for 10k documents.我尝试了以下方法，但对于 10k 文档来说速度很慢。

How to quickly and efficiently quantify the out of vocabulary tokens in collection of texts in pandas?如何快速有效地量化 pandas 中文本集合中的词汇表外标记？

OOV_text=df['mytext'].apply(lambda s: ' '.join([ word  for word in s.split() if (word not in vocab) ]))
OOV=df['mytext'].apply(lambda s: sum([(word in vocab) for word in s.split()])/len(s.split()))

Answer 1

You can use您可以使用

vocab=['word1','word2','word3','2021']
df['mytext_list']=df['mytext'].str.split(' ')
df['count']=df['mytext_list'].apply(lambda c:sum([Counter(c)[w] for w in vocab]))

从词汇表中找出单词

问题描述

1 个解决方案

解决方案1
0 2022-01-02 10:57:20

从词汇表中找出单词

问题描述

1 个解决方案

解决方案1 0 2022-01-02 10:57:20

解决方案1
0 2022-01-02 10:57:20