没有重复单词的二元组

Question

I would like to analyze a text by counting bigrams.我想通过计算二元组来分析文本。 Unfortunately my text has plenty of repeated words (like: hello hello) that I don't want to be counted as bigrams.不幸的是，我的文本中有很多重复的单词（例如：hello hello），我不想将它们算作二元组。

My code is the following:我的代码如下：

b = nltk.collocations.BigramCollocationFinder.from_words('this this is is a a test test'.split())
b.ngram_fd.keys()

that returns:返回：

>> dict_keys([('this', 'this'), ('this', 'is'), ('is', 'is'), ('is', 'a'), ('a', 'a'), ('a', 'test'), ('test', 'test')])

but I would like the output to be:但我希望输出是：

>> [('a', 'test'), ('is', 'a'), ('this', 'is')]

Do you have any suggestion, also using a different library?你有什么建议，也使用不同的图书馆？ Thank you in advance for any help.预先感谢您的任何帮助。 Francesca弗朗西斯卡

Answer 1

Try:尝试：

result_cleared = [x for x in b.ngram_fd.keys() if x[0] != x[1]]

Edit : If your texts are stored in a DataFrame, you can do the following:编辑：如果您的文本存储在 DataFrame 中，您可以执行以下操作：

# the dummy data from your comment
df=pd.DataFrame({'Text': ['this is a stupid text with no no no sense','this song says na na na','this is very very very very annoying']})

def create_bigrams(text):
    b = nltk.collocations.BigramCollocationFinder.from_words(text.split())
    return [x for x in b.ngram_fd.keys() if x[0] != x[1]]

df["bigrams"] = df["Text"].apply(create_bigrams)
df["bigrams"].apply(print)

This first adds a column containing the bigrams to the DataFrame and then prints the column values.这首先将包含二元组的列添加到 DataFrame，然后打印列值。 If you want only the output without manipulating df , replace the last two lines with:如果您只想要输出而不操作df ，请将最后两行替换为：

df["Text"].apply(create_bigrams).apply(print)

Answer 2

You could remove duplicated words before passing into the function nltk.collocations.BigramCollocationFinder.from_words您可以在传入函数nltk.collocations.BigramCollocationFinder.from_words之前删除重复的单词

words = 'this this is is a a test test'.split()
removed_duplicates = [first for first, second in zip(words, ['']+words) if first != second]

output:

['this', 'is', 'a', 'test']

and then do:然后做：

b = nltk.collocations.BigramCollocationFinder.from_words(removed_duplicates)
b.ngram_fd.keys()

没有重复单词的二元组

问题描述

2 个解决方案

解决方案1
2 2021-06-23 20:37:50

解决方案2
1 2021-06-23 20:38:30

没有重复单词的二元组

问题描述

2 个解决方案

解决方案1 2 2021-06-23 20:37:50

解决方案2 1 2021-06-23 20:38:30

解决方案1
2 2021-06-23 20:37:50

解决方案2
1 2021-06-23 20:38:30