在一个文件文本和它们的频率中查找同一行中的项目组合

Question

我有一个文件文本，两个术语列表。

file = "the workers have human rights, the women have rights, the people have to work."

list1 = ['workers, rights']
list2 = ['have', 'the']

所需的是查找列表1中的一项和列表2中的一项是否在文件的同一行中，并在文件文本级别计算其频率。 我尝试了以下代码，但它没有给出正确的频率。

freq = 0
result = []
for line in file.splitlines():
    for i in list1:
            for x in list2:
                    if i in line and x in line:
                            freq +=1
                            result.append((i,x, freq))

Answer 1

做这个：

import itertools

frequencies = {}
for line in open_file: # You don't need .splitlines() to iterate, and you shouldn't use file as a name
    line = line.strip().split()
    list1_used = (x for x in list1 if x in line)
    list2_used = (x for x in list2 if x in line)
    for combination in itertools.product(list1_used, list2_used):
        frequencies[combination] = frequencies.get(combination, 0) + 1

这将创建每对频率的字典。 例如，您可能会得到类似{('rights', 'have'): 1, ('workers', 'have'): 1, ('rights', 'the'): 1, ('workers', 'the'): 1}如果您给的行是文件对象中的唯一行。 如果要考虑给定单词出现多少次， list1_used和list2_used稍微复杂list2_used ：

list1_used = sum((((x,) * line.count(x)) for x in list1), ())
list2_used = sum((((y,) * line.count(y)) for y in list2), ())

在这里使用defaultdict可能会更容易：

from collections import defaultdict
import itertools

frequencies = defaultdict(int)
for line in open_file:
    line = line.strip().split()
    list1_used = ...
    list2_used = ...
    for combination in itertools.product(list1_used, list2_used):
        frequencies[combination] += 1

在一个文件文本和它们的频率中查找同一行中的项目组合

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-03-14 12:41:31

在一个文件文本和它们的频率中查找同一行中的项目组合

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-03-14 12:41:31

解决方案1
2 已采纳 2016-03-14 12:41:31