[英]How to find the frequency of a word in a line, in a text file - Pyspark
[英]Find combination of items in the same line in one file text and thei frequency
我有一个文件文本,两个术语列表。
file = "the workers have human rights, the women have rights, the people have to work."
list1 = ['workers, rights']
list2 = ['have', 'the']
所需的是查找列表1中的一项和列表2中的一项是否在文件的同一行中,并在文件文本级别计算其频率。 我尝试了以下代码,但它没有给出正确的频率。
freq = 0
result = []
for line in file.splitlines():
for i in list1:
for x in list2:
if i in line and x in line:
freq +=1
result.append((i,x, freq))
做这个:
import itertools
frequencies = {}
for line in open_file: # You don't need .splitlines() to iterate, and you shouldn't use file as a name
line = line.strip().split()
list1_used = (x for x in list1 if x in line)
list2_used = (x for x in list2 if x in line)
for combination in itertools.product(list1_used, list2_used):
frequencies[combination] = frequencies.get(combination, 0) + 1
这将创建每对频率的字典。 例如,您可能会得到类似{('rights', 'have'): 1, ('workers', 'have'): 1, ('rights', 'the'): 1, ('workers', 'the'): 1}
如果您给的行是文件对象中的唯一行。 如果要考虑给定单词出现多少次, list1_used
和list2_used
稍微复杂list2_used
:
list1_used = sum((((x,) * line.count(x)) for x in list1), ())
list2_used = sum((((y,) * line.count(y)) for y in list2), ())
在这里使用defaultdict
可能会更容易:
from collections import defaultdict
import itertools
frequencies = defaultdict(int)
for line in open_file:
line = line.strip().split()
list1_used = ...
list2_used = ...
for combination in itertools.product(list1_used, list2_used):
frequencies[combination] += 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.