简体   繁体   中英

Keep exact words in the list and remove others

Here I have a list a and I have another list b which includes some strings. And for strings in the list a, I want to keep the ones which appears in the list b. And remove other strings which do not appear in list b.

For example:

list_a = [['a','a','a','b','b','b','g','b','b','b'],['c','we','c','c','c','c','c','a','b','a','b','a','b','a','b']]
list_b = ['a']

The result I expect is:

Get list_a like this: [['a','a','a'],['a','a','a','a']]

However, when I run my code:

data = [['a','a','a','b','g','b'],['we','c','a','b','a','a','b','a','b']]
keep_words = ['a']
for document in data:
    print('######')
    for word in document:
        print(word)
        if word in keep_words:
            document.remove(word)
            print(document)
print('#####')
print(data)

I get this result:

line 1:######
line 2:a
line 3:['a', 'a', 'b', 'g', 'b']
line 4:a
line 5:['a', 'b', 'g', 'b']
line 6:g
line 7:b
line 8:######
line 9:we
line 10:c
line 11:a
line 12:['we', 'c', 'b', 'a', 'a', 'b', 'a', 'b']
line 13:a
line 14:['we', 'c', 'b', 'a', 'b', 'a', 'b']
line 15:b
line 16:a
line 17:['we', 'c', 'b', 'b', 'a', 'b']
line 18:#####
line 19:[['a', 'b', 'g', 'b'], ['we', 'c', 'b', 'b', 'a', 'b']]

So I am confused: Why in the line 6, it prints the word 'g' rather than word 'a'? Because in the line 5 we get a list ['a', 'b', 'g', 'b'], so in the next for loop, it should get the word 'a' at the beginning of this list.

Anyone could tell me why this happend and how to solve my problem? Thank you very much!

* Attached picture is my code and my result

Never remove elements from an array while iterating over it, here is a solution to your problem that involves replacing the sub-lists with the desired result (filtering):

data = [['a','a','a','b','g','b'],['we','c','a','b','a','a','b','a','b']]
keep_words = ['a']

for i in range(len(data)):
  data[i] = [d for d in data[i] if d in keep_words] # only keep desired data

print(data) # ==> [['a', 'a', 'a'], ['a', 'a', 'a', 'a']]

As mentioned in the comments if you mutate a list while iterating over it, you will experience these type of side effects

An alternative solution would be to take advantage of Python's super fast and readable list comprehensions

In [33]: [[a for a in l if a in list_b] for l in list_a]
Out[33]: [['a', 'a', 'a'], ['a', 'a', 'a', 'a']]

Note that as list_b grows in size you might want to consider using a set which are much faster than list s with respect to checking for membership. It will also ignore any duplicate entries

In [52]: import random

In [73]: import string

In [74]: keep_s = set(['a', 'b', 'e'])

In [75]: keep_l = ['a', 'b', 'e']

# Create a random document -- random choice of 'a'-'f' between 1-100 times
In [78]: def rand_doc():
    ...:     return [random.choice(string.ascii_lowercase[:6]) for _ in range(random.randint(1,100))]
    ...:

# Create 1000 random documents
In [79]: docs = [rand_doc() for _ in range(1000)]

In [80]: %timeit [[word for word in doc if word in keep_l] for doc in docs]
4.39 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [81]: %timeit [[word for word in doc if word in keep_s] for doc in docs]
3.16 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM