Its hard to describe the problem, but now I have a dataframe with a tokenized string, and I want to remove the most common words from it. So I got the list with the most common words and got the tail. But I don't know how to use this list to remove the words from the main column:
The column is like that:
df['tokenized']
{'dog', 'cat', 'fish'} {'car', 'dog', 'water'} {'blue', 'red', 'green'}
Each row is a list of strings
if the list of words I want to remove is {'dog', 'cat'}
The desired output is:
df['tokenized']
{'fish'}
{'car', 'water'}
{'blue', 'red', 'green'}
Any help with that?
You can do it this way:
tokenized=[['dog', 'cat' , 'fish'], ['car', 'dog', 'water'], ['blue', 'red', 'green']]
most_common_words = ['cat','dog']
for l in tokenized:
for w in most_common_words:
try:
l.remove(w)
except ValueError: pass
print(tokenized)
# output:
# [['fish'], ['car', 'water'], ['blue', 'red', 'green']]
Try with this:
mcw = {'dog', 'cat'}
df['tokenized'] = df['tokenized'].apply(
lambda lst: [word for word in lst if word not in mcw]
)
You should use a set for most common words, not a list (because it's much faster to check if an element belongs to a set).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.