简体   繁体   English

Python-查找文本文件中同一行中每个可能的单词对出现频率的最有效方法?

[英]Python - Most efficient way to find how often each possible pair of words occurs in the same line in a text file?

This particular problem is easy to solve, but I'm not so sure that the solution I'd arrive at would be computationally efficient. 这个特殊的问题很容易解决,但是我不确定要达到的解决方案在计算上是否有效。 So I'm asking the experts! 所以我要问专家!

What would be the best way to go through a large file, collecting stats (for the entire file) on how often two words occur in the same line? 遍历大文件并收集有关同一行中两个单词出现频率(针对整个文件)的统计信息的最佳方法是什么?

For instance, if the text contained only the following two lines: 例如,如果文本仅包含以下两行:

"This is the white baseball." “这是白色的棒球。” "These guys have white baseball bats." “这些家伙有白色的棒球棒。”

You would end up collecting the following stats: (this, is: 1), (this, the: 1), (this, white: 1), (this, baseball: 1), (is, the: 1), (is, white: 1), (is, baseball: 1) ... and so forth. 您最终将收集以下统计信息:(this,is:1),(this,the:1),(this,white:1),(this,棒球:1),(is,the:1),(是,白色:1),(是,棒球:1)...依此类推。

For the entry (baseball, white: 2), the value would be 2, since this pair of words occurs in the same line a total of 2 times. 对于条目(棒球,白色:2),该值为2,因为这对单词在同一行中总共出现2次。

Ideally, the stats should be placed in a dictionary, where the keys are alphabetized at the tuple level (ie, you wouldn't want separate entries for "this, is" and "is, this." We don't care about order here: we just want to find how often each possible pair of words occurs in the same line throughout the text. 理想情况下,应该将统计信息放在字典中,在该字典中,键在元组级别按字母顺序排列(即,您不需要“ this,is”和“ is,this”的单独条目。我们不在乎顺序。此处:我们只想查找每个可能的单词对在整个文本的同一行中出现的频率。

from collections import defaultdict
import itertools as it
import re

pairs = defaultdict(int)

for line in lines:
    for pair in it.combinations(re.findall('\w+', line), 2):
        pairs[tuple(pair)] += 1

resultList = [pair + (occurences, ) for pair, occurences in pairs.iterkeys()]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Python中修改大型文本文件的最后一行的最有效方法 - Most efficient way to modify the last line of a large text file in Python 遍历文件每一行的最有效方法是什么? - What is the most efficient way of looping over each line of a file? 在python中找到直线和圆的交点的最有效方法是什么? - What is most efficient way to find the intersection of a line and a circle in python? 获取文本文件的第一行和最后一行的最有效方法是什么? - What is the most efficient way to get first and last line of a text file? 在Python中翻译单词的最有效方法 - Most efficient way to translate words in Python 查找元素文本列表的最有效方法 Selenium Python - Most efficient way to find list of element text Selenium Python 在python中将文本文件内容转换为字典的最有效方法 - most efficient way to convert text file contents into a dictionary in python 如何在python的500个文本文件中找到500个最常用的单词? - How to find 500 most frequent words in 500 text files in python? 在 python 中查找邻居的最有效方法 - Most efficient way to find neighbors of neighbors in python 查找每列中出现次数最多的元素的最简单方法 - Simplest way to find the element that occurs the most in each column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM