简体   繁体   English

Python 按分位数过滤较大的文本

[英]Python filter larger text by quantile

Assume I am process a very large text file, I have the following pseudocode假设我正在处理一个非常大的文本文件,我有以下伪代码

xx_valueList = []
lines=[]
with line in file: 
    xx_value = calc_xxValue(line)
    xx_valueList.append(xx_value)
    lines.append(lines)

# get_quantile_value is a function return the cutoff value with a specific quantile precent
cut_offvalue = get_quantile_value(xx_valueList, precent=0.05)
for line in lines: 
    if calc_xxValue(line) > cut_offvalue: 
         # do someting here

Note that the file is very large and may come from a pipe, so I don't want to read it twice.注意文件很大,可能来自一个pipe,不想看两遍。

We must read the entire file before we can get the cutoff to filter file我们必须先读取整个文件才能获得过滤文件的截断值

The above method can work, but it consumes too much memory, is there some algorithmic optimization that can improve efficiency and reduce memory consumption?上面的方法可以,但是memory的消耗太大了,有没有什么算法优化可以提高效率,减少memory的消耗?

xx_value_list = []
cut_offvalue = 0
with open(file, 'r') as f:
    for line in f:
        xx_value = calc_xxValue(line)
        xx_value_list.append(xx_value)
        if len(xx_value_list) % 100 == 0:
            cut_offvalue = get_quantile_value(xx_value_list, precent=0.05)
        if xx_value < cut_offvalue: 
            # do something here
            pass

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM