简体   繁体   English

我怎么能算出特定的二元词呢?

[英]how can I count the specific bigram words?

I want to find and count the specific bigram words such as "red apple" in the text file. 我想在文本文件中查找并计算特定的二元词,例如“红苹果”。 I already made the text file to the word list, so I couldn't use regex to count the whole phrase. 我已经将文本文件设置为单词列表,因此我无法使用正则表达式计算整个短语。 (ie bigram) ( or can I ? ) (即二元组)(或者我可以吗?)

How can I count the specific bigram in the text file? 如何计算文本文件中的特定二元组? not using nltk or other module... regex can be a solution? 不使用nltk或其他模块...正则表达式可以解决?

Why you have made text file into list. 为什么要将文本文件放入列表中。 Also it's not memory efficient. 它也没有内存效率。 Instead of text you can use file.read() method directly. 您可以直接使用file.read()方法而不是文本。

import re

text = 'I like red apples and green apples but I like red apples more.'
bigram = ['red apples', 'green apples']

for i in bigram:
    print 'Found', i, len(re.findall(i, text))

out: 出:

Found red apples 2
Found green apples 1

Are you looking only for a specific bigrams or you might need to extend the search to detect any bigrams common in your text or something? 您是否只查找特定的双字母组合,或者您可能需要扩展搜索以检测文本中常见的任何双字母组件? In the latter case have a look at NLTK collocations module . 在后一种情况下,看看NLTK搭配模块 You say you want to do this without using NLTK or other module, but in practice that's a very very bad idea. 你说你想在不使用NLTK或其他模块的情况下这样做,但在实践中这是一个非常糟糕的主意。 You'll miss what you are looking for due to there being eg 'red apple', not 'red apples'. 你会想你在找什么,由于那里是 “红苹果”,而不是“红苹果”。 NLTK, on the other hand, provides useful tools for lemmatizaton, calculating tons of the statistics and such. 另一方面,NLTK为lemmatizaton提供了有用的工具,计算了大量的统计数据等。

And think of this: why and how have you turned the lines to list of words? 想一想:为什么以及如何将这些行改为单词列表? Not only this is inefficient, but depending on exactly how you did that you may have lost information on word order, improperly processed punctuation, messed up uppercase/lowercase, or made a million of other mistakes. 这不仅效率低下,而且取决于你究竟是怎么做的,你可能丢失了关于单词顺序,不正确处理的标点符号,搞砸了大写/小写或者犯了一百万个其他错误的信息。 Which, again, is why NLTK is what you need. 这也是为什么NLTK是你需要的原因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM