[英]Removing Custom-Defined Words from List (Part II)- Python
This is a continuation of my previous thread: Removing Custom-Defined Words from List - Python这是我之前的主题的延续: 从列表中删除自定义单词 - Python
I have a df as such:我有一个 df 这样的:
df = pd.DataFrame({'PageNumber': [175, 162, 576], 'new_tags': [['flower architecture people'], ['hair red bobbles'], ['sweets chocolate shop']})
<OUT>
PageNumber new_tags
175 flower architecture people...
162 hair red bobbles...
576 sweets chocolate shop...
And another df (which will act as the reference df (see more below)):和另一个 df(它将作为参考 df(见下文)):
top_words= pd.DataFrame({'ID': [1,2,3], 'tag':['flower, people, chocolate']})
<OUT>
ID tag
1 flower
2 people
3 chocolate
I'm trying to remove values in a list in a df based on the values of another df.我正在尝试根据另一个 df 的值删除 df 列表中的值。 The output I wish to gain is:
我希望获得的 output 是:
<OUT> df
PageNumber new_tags
175 flower people
576 chocolate
I've tried the inner join method: Filtering the dataframe based on the column value of another dataframe , however no luck unfortunately.我尝试了内部连接方法: 根据另一个 dataframe 的列值过滤 dataframe ,但不幸的是没有运气。
So I have resorted to tokenizing all tags in both of the df columns and trying to loop through each and retaining only the values in the reference df.所以我求助于标记化两个 df 列中的所有标签,并尝试遍历每个标签并仅保留参考 df 中的值。 Currently, it returns empty lists...
目前,它返回空列表...
df['tokenised_new_tags'] = filtered_new["new_tags"].astype(str).apply(nltk.word_tokenize)
topic_words['tokenised_top_words']= topic_words['tag'].astype(str).apply(nltk.word_tokenize)
df['top_word_tokens'] = [[t for t in tok_sent if t in topic_words['tokenised_top_words']] for tok_sent in df['tokenised_new_tags']]
Any help is much appreciated - thanks!非常感谢任何帮助 - 谢谢!
How about this:这个怎么样:
def remove_custom_words(phrase, words_to_remove_list):
return([ elem for elem in phrase.split(' ') if elem not in words_to_remove_list])
df['new_tags'] = df['new_tags'].apply(lambda x: remove_custom_words(x[0],top_words['tag'].to_list()))
Basically I am applying remove_custom_words
function for each row of the dataset.基本上,我为数据集的每一行应用
remove_custom_words
function。 Then we filter and remove the words contained in top_words['tag']
然后我们过滤
top_words['tag']
中包含的词
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.