在NLTK中找到两个文本语料库之间的常用词

Question

I am very new to NLTK and am trying to do something. 我是NLTK的新手，正在尝试做一些事情。

What would be the best way to find the common words between two bodies of text? 在两个文本主体之间找到通用词的最佳方法是什么？ Basically, I have one long text file say text1, and another say text2. 基本上，我有一个长文本文件为text1，另一个为text2。 I want to find the common words that appear in both the files using NLTK. 我想找到使用NLTK出现在两个文件中的常用词。

Is there a direct way to do so? 有直接的方法吗？ What would be the best approach? 最好的方法是什么？

Thanks! 谢谢！

Answer 1

It seems to me that unless you need to do something special with regards to language processing, you don't need NLTK: 在我看来，除非您需要在语言处理方面做一些特别的事情，否则您不需要NLTK：

words1 = "This is a simple test of set intersection".lower().split()
words2 = "Intersection of sets is easy using Python".lower().split()

intersection = set(words1) & set(words2)

>>> set(['of', 'is', 'intersection'])

在NLTK中找到两个文本语料库之间的常用词

问题描述

1 个解决方案

解决方案1
0 已采纳 2013-05-03 05:19:03

在NLTK中找到两个文本语料库之间的常用词

问题描述

1 个解决方案

解决方案1 0 已采纳 2013-05-03 05:19:03

解决方案1
0 已采纳 2013-05-03 05:19:03