简体   繁体   English

从列表中删除自定义词(第二部分)- Python

[英]Removing Custom-Defined Words from List (Part II)- Python

This is a continuation of my previous thread: Removing Custom-Defined Words from List - Python这是我之前的主题的延续: 从列表中删除自定义单词 - Python

I have a df as such:我有一个 df 这样的:

df = pd.DataFrame({'PageNumber': [175, 162, 576], 'new_tags': [['flower architecture people'], ['hair red bobbles'], ['sweets chocolate shop']})

<OUT>
PageNumber   new_tags
   175       flower architecture people...
   162       hair red bobbles...
   576       sweets chocolate shop...

And another df (which will act as the reference df (see more below)):和另一个 df(它将作为参考 df(见下文)):

top_words= pd.DataFrame({'ID': [1,2,3], 'tag':['flower, people, chocolate']})

<OUT>
   ID      tag
   1       flower
   2       people
   3       chocolate

I'm trying to remove values in a list in a df based on the values of another df.我正在尝试根据另一个 df 的值删除 df 列表中的值。 The output I wish to gain is:我希望获得的 output 是:

<OUT> df
PageNumber   new_tags
   175       flower people
   576       chocolate

I've tried the inner join method: Filtering the dataframe based on the column value of another dataframe , however no luck unfortunately.我尝试了内部连接方法: 根据另一个 dataframe 的列值过滤 dataframe ,但不幸的是没有运气。

So I have resorted to tokenizing all tags in both of the df columns and trying to loop through each and retaining only the values in the reference df.所以我求助于标记化两个 df 列中的所有标签,并尝试遍历每个标签并仅保留参考 df 中的值。 Currently, it returns empty lists...目前,它返回空列表...

df['tokenised_new_tags'] = filtered_new["new_tags"].astype(str).apply(nltk.word_tokenize)
topic_words['tokenised_top_words']= topic_words['tag'].astype(str).apply(nltk.word_tokenize)
df['top_word_tokens'] = [[t for t in tok_sent if t in topic_words['tokenised_top_words']] for tok_sent in df['tokenised_new_tags']]

Any help is much appreciated - thanks!非常感谢任何帮助 - 谢谢!

How about this:这个怎么样:

def remove_custom_words(phrase, words_to_remove_list):
    return([ elem for elem in phrase.split(' ') if elem not in words_to_remove_list])


df['new_tags'] = df['new_tags'].apply(lambda x: remove_custom_words(x[0],top_words['tag'].to_list()))

Basically I am applying remove_custom_words function for each row of the dataset.基本上,我为数据集的每一行应用remove_custom_words function。 Then we filter and remove the words contained in top_words['tag']然后我们过滤top_words['tag']中包含的词

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM