简体   繁体   中英

remove a list of string from a column of list of strings

Its hard to describe the problem, but now I have a dataframe with a tokenized string, and I want to remove the most common words from it. So I got the list with the most common words and got the tail. But I don't know how to use this list to remove the words from the main column:

The column is like that:

df['tokenized']

 {'dog', 'cat', 'fish'} {'car', 'dog', 'water'} {'blue', 'red', 'green'}

Each row is a list of strings

if the list of words I want to remove is {'dog', 'cat'}

The desired output is:

df['tokenized']

{'fish'}

{'car', 'water'}

{'blue', 'red', 'green'}

Any help with that?

You can do it this way:

tokenized=[['dog', 'cat' , 'fish'], ['car', 'dog', 'water'], ['blue', 'red', 'green']]
most_common_words = ['cat','dog']
for l in tokenized:
    for w in most_common_words:
        try:
            l.remove(w)
        except ValueError: pass
print(tokenized)

# output:
# [['fish'], ['car', 'water'], ['blue', 'red', 'green']]

Try with this:

mcw = {'dog', 'cat'}
df['tokenized'] = df['tokenized'].apply(
    lambda lst: [word for word in lst if word not in mcw]
)

You should use a set for most common words, not a list (because it's much faster to check if an element belongs to a set).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM