繁体   English   中英

在一个文件文本和它们的频率中查找同一行中的项目组合

[英]Find combination of items in the same line in one file text and thei frequency

我有一个文件文本,两个术语列表。

file = "the workers have human rights, the women have rights, the people have to work."

list1 = ['workers, rights']
list2 = ['have', 'the']

所需的是查找列表1中的一项和列表2中的一项是否在文件的同一行中,并在文件文本级别计算其频率。 我尝试了以下代码,但它没有给出正确的频率。

freq = 0
result = []
for line in file.splitlines():
    for i in list1:
            for x in list2:
                    if i in line and x in line:
                            freq +=1
                            result.append((i,x, freq))

做这个:

import itertools

frequencies = {}
for line in open_file: # You don't need .splitlines() to iterate, and you shouldn't use file as a name
    line = line.strip().split()
    list1_used = (x for x in list1 if x in line)
    list2_used = (x for x in list2 if x in line)
    for combination in itertools.product(list1_used, list2_used):
        frequencies[combination] = frequencies.get(combination, 0) + 1

这将创建每对频率的字典。 例如,您可能会得到类似{('rights', 'have'): 1, ('workers', 'have'): 1, ('rights', 'the'): 1, ('workers', 'the'): 1}如果您给的行是文件对象中的唯一行。 如果要考虑给定单词出现多少次, list1_usedlist2_used稍微复杂list2_used

list1_used = sum((((x,) * line.count(x)) for x in list1), ())
list2_used = sum((((y,) * line.count(y)) for y in list2), ())

在这里使用defaultdict可能会更容易:

from collections import defaultdict
import itertools

frequencies = defaultdict(int)
for line in open_file:
    line = line.strip().split()
    list1_used = ...
    list2_used = ...
    for combination in itertools.product(list1_used, list2_used):
        frequencies[combination] += 1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM