remove a list of string from a column of list of strings

Question

Its hard to describe the problem, but now I have a dataframe with a tokenized string, and I want to remove the most common words from it. So I got the list with the most common words and got the tail. But I don't know how to use this list to remove the words from the main column:

The column is like that:

df['tokenized']

 {'dog', 'cat', 'fish'} {'car', 'dog', 'water'} {'blue', 'red', 'green'}

Each row is a list of strings

if the list of words I want to remove is {'dog', 'cat'}

The desired output is:

df['tokenized']

{'fish'}

{'car', 'water'}

{'blue', 'red', 'green'}

Any help with that?

Answer 1

You can do it this way:

tokenized=[['dog', 'cat' , 'fish'], ['car', 'dog', 'water'], ['blue', 'red', 'green']]
most_common_words = ['cat','dog']
for l in tokenized:
    for w in most_common_words:
        try:
            l.remove(w)
        except ValueError: pass
print(tokenized)

# output:
# [['fish'], ['car', 'water'], ['blue', 'red', 'green']]

Answer 2

Try with this:

mcw = {'dog', 'cat'}
df['tokenized'] = df['tokenized'].apply(
    lambda lst: [word for word in lst if word not in mcw]
)

You should use a set for most common words, not a list (because it's much faster to check if an element belongs to a set).

remove a list of string from a column of list of strings

Question

2 answers

solution1
0 2022-01-20 12:08:06

solution2
0 2022-01-20 12:43:50

remove a list of string from a column of list of strings

Question

2 answers

solution1 0 2022-01-20 12:08:06

solution2 0 2022-01-20 12:43:50

solution1
0 2022-01-20 12:08:06

solution2
0 2022-01-20 12:43:50