简体   繁体   English

从词汇表中找出单词

[英]find words out of vocabulary

I have some texts in a pandas dataframe df['mytext'] I have also got a vocabulary vocab (list of words).我在 pandas dataframe df['mytext']中有一些文本我还有一个词汇表( vocab列表)。

I am trying to list and count the words out of vocabulary for each document我正在尝试列出并计算每个文档的词汇表中的单词

I have tried the following but it is quite slow for 10k documents.我尝试了以下方法,但对于 10k 文档来说速度很慢。

How to quickly and efficiently quantify the out of vocabulary tokens in collection of texts in pandas?如何快速有效地量化 pandas 中文本集合中的词汇表外标记?

OOV_text=df['mytext'].apply(lambda s: ' '.join([ word  for word in s.split() if (word not in vocab) ]))
OOV=df['mytext'].apply(lambda s: sum([(word in vocab) for word in s.split()])/len(s.split()))

You can use您可以使用

vocab=['word1','word2','word3','2021']
df['mytext_list']=df['mytext'].str.split(' ')
df['count']=df['mytext_list'].apply(lambda c:sum([Counter(c)[w] for w in vocab]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM