简体   繁体   中英

Removing efficiently elements from a list

I got a huge list of object: (about 500k elts).

class Signal:
    def __init__(self, fq, t0, tf):
        self.fq = fq
        self.t0 = t0
        self.tf = tf

    def __eq__(self, s):
        """ == comparison method."""
        return self.fq == s.fq

    def __ne__(self, s):
        """ != comparison method."""
        return not self.__eq__(self, s)

t0, tf = 0, 200
signals = [[S(f1, t0, tf), S(f2, t0, tf), S(f3, t0, tf), S(f4, t0, tf), S(f5, t0, tf), S(f6, t0, tf)] 
            for f1 in frequencies for f2 in frequencies for f3 in frequencies for f4 in frequencies
            for f5 in frequencies for f6 in frequencies]

My program maps the list and generates .pkl files with a specific name for each element of the list.

def file_namer(signals):
    frequencies = tuple([s.fq for s in signals])
    return "F{}.pkl".format(frequencies)

Some of the elements from the list were already computed, OR, a permutation of this element was computed, thus I would like to delete them before mapping.

import itertools
import os

folder = "folder_of_the_pkl"
files = os.listdir(folder)

def is_computed(files, s):
    possibilities = list()
    for elt in itertools.permutations(s):
        possibilities.append(file_namer(s))

    if any([name in files for name in possibilities]):
        return True
    else:
        return False

s_to_remove = list()
for s in signals:
    if is_computed(files, s):
        s_to_remove.append(s)

for elt in s_to_remove:
    signals.remove(elt)

That is what I come up with. It is fairly not efficient, and I'd be glad to see your proposition to improve this!

Thanks!

NB: This is a fairly simplified version of my program. The objects are far heavier (10+ parameters).

I would suggest that you don't remove from the list. Build another:

signals_ = list()
for s in signals:
    if not is_computed(files, s):
        signals_.append(s)

signals = signals_

Then I would look at your is_computed function and see if one can avoid building the list of possibilities:

def is_computed(files, s):
    for elt in itertools.permutations(s):
        name = file_namer(s)
        if name in files:
           return True

    return False

The test: name in files would be faster if files is a set .

Better still:

Parse each of the filenames such as this one:

>>> "F{}.pkl".format((1,2,3))
'F(1, 2, 3).pkl'

Back into the tuple:

>>> import ast
>>> ast.literal_eval('F(1, 2, 3).pkl'[1:].split('.')[0])
(1, 2, 3)

Then you can avoid the permutations call by sorting (1,2,3) and (2,3,1) :

>>> sorted((1,2,3)) == sorted((2,3,1))
True

Into the same order and then comparing the sorted versions.

So to extend this to a is_computed replacement, files is turned from ['F(1, 2, 3).pkl'] into {(1, 2, 3):'F(1, 2, 3).pkl'} then is_computed becomes:

files = { tuple(sorted(ast.literal_eval(name[1:].split('.')[0]))): name 
          for name in files }

def is_computed(files, signal):
    name = tuple(sorted(s.fq for s in signal))
    return name in files

First, if you have so many of your objects you may want to consider slots , namedtuple s, pandas dataframes or numpy ndarray s. This would reduce the cost of each item considerably, removing per-object dictionaries or even per-row object metadata.

Second, removing items from an array is a costly operation involving moving all the items following it. This applies to Python's lists when using del or remove; even worse with the latter, it has to find the item first, so you're reading the whole array and rewriting part of it for every item you remove. At that point it's better to build a copy containing the items you keep instead. Another option is to replace the irrelevant items with placeholders such as None, an operation that doesn't require moving other entries.

Third, it is frequently more efficient to not build your collections at all. Consider:

def is_computed(files, s):
    possibilities = list()
    for elt in itertools.permutations(s):
        possibilities.append(file_namer(s))

    if any([name in files for name in possibilities]):
        return True
    else:
        return False

In this code, you construct a (likely large) list named possibilities, grow it by consuming a permutations iterator in a for loop and calling file_namer for each item (not even passing that item!), then build another list of whether each possibility was already in files, and finally apply any() to that list for a result. That's at least two passes over the entire collection of possibilities for an answer that might have only needed to inspect one. I'm not sure the first loop even needs to exist, and the list comprehension should certainly be a generator expression to allow the any function to shortcut. So, assuming there are no side effects hidden in file_namer etc, we could simplify the entire function to:

def is_computed(files, s):
    return file_namer(s) in files

But if file_namer(s) should really be file_namer(elt), as I would expect, it should be:

def is_computed(files, s):
    return any(file_namer(elt) in files
               for elt in itertools.permutations(s))

Another concern, since we're looking at repeated in tests for files , is that we should probably make sure that's a set , dict or other type with quick membership tests . This would be the point where Dan D's suggestion of sorting instead of repeatedly generating permutations applies. For instance, you could have an index of the lowest-valued (sorted) permutation to the actual object stored in a dictionary. If for some reason you can't make the keys hashable, you might be able to use binary searches if they're sortable.

That's what came to mind at the moment. I haven't read thoroughly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM