简体   繁体   中英

Remove elements from lists based on condition

I have the following code:

from collections import defaultdict
import pandas as pd

THRESHOLD = 3 

item_counts = defaultdict(int)

df = {'col1':['1 2 3 4 5 6 7', '1 3 6 7','2 6 7']}
lines = pd.DataFrame(data=df)

print(lines)

for line in lines['col1']:
    for item in line.split():
        item_counts[item] += 1

print(item_counts)         
for line in lines['col1']:
    for item in line.split():
        if item_counts[item] < THRESHOLD:
            del item

print(lines)

My goal is that every item is getting counted and that the items below the threshold get eliminated from my dataframe. In this case, only 6 and 7 should be kept and the rest should be removed. The defaultdict is working fine, but the deletion of items is not working.

Do you know what I am doing wrong?

using del is not the same as removing an element from a list. consider the following example

>>> x=1
>>> y=2
>>> lst = [x,y]
>>> del x
>>> print(lst)
[1, 2]
>>> lst.remove(x)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
NameError: name 'x' is not defined
>>> lst.remove(y)
>>> print(lst)
[1]
>>> print(y)
2

as you can see using del on the variable sharing the pointer to the element in the list only deleted the pointer leaving the list as it was. remove did the opposite. it removed the element from the list but did not delete the variable pointer.

as for fixing the problem: you should not directly remove from a list while iterating.

IMO the best fix is using list comprehension to make a new list with only the wanted elements and replacing the old one:

for line in lines['col1']:
    line = [item for item in line.split() if item >= THRESHOLD
    # line = ' '.join(line)

PS added the commented line if you wish to return the line to a string

If you don't need a DataFrame (I don't see why you would for this), you can do this:

from collections import Counter

THRESHOLD = 3
lines = {'col1':['1 2 3 4 5 6 7', '1 3 6 7','2 6 7']}

# make proper list of ints
z = {k: [[int(x) for x in v.split()] for v in vals] for k, vals in lines.items()}
print(z)
# {'col1': [[1, 2, 3, 4, 5, 6, 7], [1, 3, 6, 7], [2, 6, 7]]}

# count the items within each value of the dict
z = {k: Counter(x for vals in arr for x in vals) for k, arr in z.items()}
print(z)
# {'col1': Counter({6: 3, 7: 3, 1: 2, 2: 2, 3: 2, 4: 1, 5: 1})}

# select the items that are seen at least THRESHOLD times
z = {col: [k for k, v in cnt.items() if v >= THRESHOLD] for col, cnt in z.items()}
print(z)
# {'col1': [6, 7]}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM