[英]Python - Most efficient way to find how often each possible pair of words occurs in the same line in a text file?
This particular problem is easy to solve, but I'm not so sure that the solution I'd arrive at would be computationally efficient. 这个特殊的问题很容易解决,但是我不确定要达到的解决方案在计算上是否有效。 So I'm asking the experts!
所以我要问专家!
What would be the best way to go through a large file, collecting stats (for the entire file) on how often two words occur in the same line? 遍历大文件并收集有关同一行中两个单词出现频率(针对整个文件)的统计信息的最佳方法是什么?
For instance, if the text contained only the following two lines: 例如,如果文本仅包含以下两行:
"This is the white baseball." “这是白色的棒球。” "These guys have white baseball bats."
“这些家伙有白色的棒球棒。”
You would end up collecting the following stats: (this, is: 1), (this, the: 1), (this, white: 1), (this, baseball: 1), (is, the: 1), (is, white: 1), (is, baseball: 1) ... and so forth. 您最终将收集以下统计信息:(this,is:1),(this,the:1),(this,white:1),(this,棒球:1),(is,the:1),(是,白色:1),(是,棒球:1)...依此类推。
For the entry (baseball, white: 2), the value would be 2, since this pair of words occurs in the same line a total of 2 times. 对于条目(棒球,白色:2),该值为2,因为这对单词在同一行中总共出现2次。
Ideally, the stats should be placed in a dictionary, where the keys are alphabetized at the tuple level (ie, you wouldn't want separate entries for "this, is" and "is, this." We don't care about order here: we just want to find how often each possible pair of words occurs in the same line throughout the text. 理想情况下,应该将统计信息放在字典中,在该字典中,键在元组级别按字母顺序排列(即,您不需要“ this,is”和“ is,this”的单独条目。我们不在乎顺序。此处:我们只想查找每个可能的单词对在整个文本的同一行中出现的频率。
from collections import defaultdict
import itertools as it
import re
pairs = defaultdict(int)
for line in lines:
for pair in it.combinations(re.findall('\w+', line), 2):
pairs[tuple(pair)] += 1
resultList = [pair + (occurences, ) for pair, occurences in pairs.iterkeys()]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.