简体   繁体   English

没有重复单词的二元组

[英]Bigram without repeated words

I would like to analyze a text by counting bigrams.我想通过计算二元组来分析文本。 Unfortunately my text has plenty of repeated words (like: hello hello) that I don't want to be counted as bigrams.不幸的是,我的文本中有很多重复的单词(例如:hello hello),我不想将它们算作二元组。

My code is the following:我的代码如下:

b = nltk.collocations.BigramCollocationFinder.from_words('this this is is a a test test'.split())
b.ngram_fd.keys()

that returns:返回:

>> dict_keys([('this', 'this'), ('this', 'is'), ('is', 'is'), ('is', 'a'), ('a', 'a'), ('a', 'test'), ('test', 'test')])

but I would like the output to be:但我希望输出是:

>> [('a', 'test'), ('is', 'a'), ('this', 'is')]

Do you have any suggestion, also using a different library?你有什么建议,也使用不同的图书馆? Thank you in advance for any help.预先感谢您的任何帮助。 Francesca弗朗西斯卡

Try:尝试:

result_cleared = [x for x in b.ngram_fd.keys() if x[0] != x[1]]

Edit : If your texts are stored in a DataFrame, you can do the following:编辑:如果您的文本存储在 DataFrame 中,您可以执行以下操作:

# the dummy data from your comment
df=pd.DataFrame({'Text': ['this is a stupid text with no no no sense','this song says na na na','this is very very very very annoying']})

def create_bigrams(text):
    b = nltk.collocations.BigramCollocationFinder.from_words(text.split())
    return [x for x in b.ngram_fd.keys() if x[0] != x[1]]

df["bigrams"] = df["Text"].apply(create_bigrams)
df["bigrams"].apply(print)

This first adds a column containing the bigrams to the DataFrame and then prints the column values.这首先将包含二元组的列添加到 DataFrame,然后打印列值。 If you want only the output without manipulating df , replace the last two lines with:如果您只想要输出而不操作df ,请将最后两行替换为:

df["Text"].apply(create_bigrams).apply(print)

You could remove duplicated words before passing into the function nltk.collocations.BigramCollocationFinder.from_words您可以在传入函数nltk.collocations.BigramCollocationFinder.from_words之前删除重复的单词

words = 'this this is is a a test test'.split()
removed_duplicates = [first for first, second in zip(words, ['']+words) if first != second]

output:

['this', 'is', 'a', 'test']

and then do:然后做:

b = nltk.collocations.BigramCollocationFinder.from_words(removed_duplicates)
b.ngram_fd.keys()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM