在一个文件文本和它们的频率中查找同一行中的项目组合

Question

I have one file text, two lists of terms. 我有一个文件文本，两个术语列表。

file = "the workers have human rights, the women have rights, the people have to work."

list1 = ['workers, rights']
list2 = ['have', 'the']

What needed is to find if one item in list1 and one item in list2 are in the same line in the file, and calcolate their frequency at the level of the file text. 所需的是查找列表1中的一项和列表2中的一项是否在文件的同一行中，并在文件文本级别计算其频率。 I tried the following code, but it does not give the correct frequency. 我尝试了以下代码，但它没有给出正确的频率。

freq = 0
result = []
for line in file.splitlines():
    for i in list1:
            for x in list2:
                    if i in line and x in line:
                            freq +=1
                            result.append((i,x, freq))

Answer 1

Do this: 做这个：

import itertools

frequencies = {}
for line in open_file: # You don't need .splitlines() to iterate, and you shouldn't use file as a name
    line = line.strip().split()
    list1_used = (x for x in list1 if x in line)
    list2_used = (x for x in list2 if x in line)
    for combination in itertools.product(list1_used, list2_used):
        frequencies[combination] = frequencies.get(combination, 0) + 1

This will create a dictionary of the frequencies for each pair. 这将创建每对频率的字典。 For example, you might get something like {('rights', 'have'): 1, ('workers', 'have'): 1, ('rights', 'the'): 1, ('workers', 'the'): 1} if the line you gave were the only line in the file object. 例如，您可能会得到类似{('rights', 'have'): 1, ('workers', 'have'): 1, ('rights', 'the'): 1, ('workers', 'the'): 1}如果您给的行是文件对象中的唯一行。 If you want to take into consideration how many times a given word shows up, it's a little more complicated for list1_used and list2_used : 如果要考虑给定单词出现多少次， list1_used和list2_used稍微复杂list2_used ：

list1_used = sum((((x,) * line.count(x)) for x in list1), ())
list2_used = sum((((y,) * line.count(y)) for y in list2), ())

It might be easier to use defaultdict here: 在这里使用defaultdict可能会更容易：

from collections import defaultdict
import itertools

frequencies = defaultdict(int)
for line in open_file:
    line = line.strip().split()
    list1_used = ...
    list2_used = ...
    for combination in itertools.product(list1_used, list2_used):
        frequencies[combination] += 1

在一个文件文本和它们的频率中查找同一行中的项目组合

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-03-14 12:41:31

在一个文件文本和它们的频率中查找同一行中的项目组合

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-03-14 12:41:31

解决方案1
2 已采纳 2016-03-14 12:41:31