简体   繁体   English

在NLTK中找到两个文本语料库之间的常用词

[英]Finding the common words between two text corpus in NLTK

I am very new to NLTK and am trying to do something. 我是NLTK的新手,正在尝试做一些事情。

What would be the best way to find the common words between two bodies of text? 在两个文本主体之间找到通用词的最佳方法是什么? Basically, I have one long text file say text1, and another say text2. 基本上,我有一个长文本文件为text1,另一个为text2。 I want to find the common words that appear in both the files using NLTK. 我想找到使用NLTK出现在两个文件中的常用词。

Is there a direct way to do so? 有直接的方法吗? What would be the best approach? 最好的方法是什么?

Thanks! 谢谢!

It seems to me that unless you need to do something special with regards to language processing, you don't need NLTK: 在我看来,除非您需要在语言处理方面做一些特别的事情,否则您不需要NLTK:

words1 = "This is a simple test of set intersection".lower().split()
words2 = "Intersection of sets is easy using Python".lower().split()

intersection = set(words1) & set(words2)

>>> set(['of', 'is', 'intersection'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM