Python 按分位数过滤较大的文本

Question

Assume I am process a very large text file, I have the following pseudocode假设我正在处理一个非常大的文本文件，我有以下伪代码

xx_valueList = []
lines=[]
with line in file: 
    xx_value = calc_xxValue(line)
    xx_valueList.append(xx_value)
    lines.append(lines)

# get_quantile_value is a function return the cutoff value with a specific quantile precent
cut_offvalue = get_quantile_value(xx_valueList, precent=0.05)
for line in lines: 
    if calc_xxValue(line) > cut_offvalue: 
         # do someting here

Note that the file is very large and may come from a pipe, so I don't want to read it twice.注意文件很大，可能来自一个pipe，不想看两遍。

We must read the entire file before we can get the cutoff to filter file我们必须先读取整个文件才能获得过滤文件的截断值

The above method can work, but it consumes too much memory, is there some algorithmic optimization that can improve efficiency and reduce memory consumption?上面的方法可以，但是memory的消耗太大了，有没有什么算法优化可以提高效率，减少memory的消耗？

Answer 1

xx_value_list = []
cut_offvalue = 0
with open(file, 'r') as f:
    for line in f:
        xx_value = calc_xxValue(line)
        xx_value_list.append(xx_value)
        if len(xx_value_list) % 100 == 0:
            cut_offvalue = get_quantile_value(xx_value_list, precent=0.05)
        if xx_value < cut_offvalue: 
            # do something here
            pass

Python 按分位数过滤较大的文本

问题描述

1 个解决方案

解决方案1
1 已采纳 2023-01-14 08:33:15

Python 按分位数过滤较大的文本

问题描述

1 个解决方案

解决方案1 1 已采纳 2023-01-14 08:33:15

解决方案1
1 已采纳 2023-01-14 08:33:15