简体   繁体   English

试图从列表中删除匹配的字符串

[英]trying to remove matching strings from list

I have two strings one that contains a sentence and another a list of names.我有两个字符串,一个包含句子,另一个包含名称列表。 Look to the commets in the code to see how they are formatted.查看代码中的注释以了解它们的格式。

I am trying to go through a column in a database and remove all the names from the sentence.我正在尝试通过数据库中的一列 go 并从句子中删除所有名称。

The sentences appear to be unchanged after calling the function.拨打function后,句子似乎没有变化。

with open('names.txt', 'r') as f:
    NAMES = set(f.read().splitlines())
NAMES = [name.lower() for name in NAMES]

def remove_names(df, col, NAMES):
    for idx in range(df.shape[0]):
        print("\r", idx, df.shape[0], idx/df.shape[0], end="\r")
        # your list of texts
        texts=df[col][idx]
        #texts looks like
        #['explain', 'decided', 'make', 'coverage', 'area', 'rubbish', 'online', 'checker', 'correct', 'sky', 'account', 'connection']
        holder_list = []
        for word in texts:
            #NAMES looks like
            # ['pascha', 'lang', 'desaray', 'camielle', 'marquasha', 'trasha', 'shaquila',...
            for name in NAMES:
                if name == word or name == word + "'s":
                    continue
                else:
                    holder_list.append(word)
        df[col][idx] = holder_list.copy()
    return df[col]
df_norm['Full Text'] = remove_names(df_norm, 'Full Text', NAMES)

I updated your remove_names function:我更新了你的remove_names function:

def remove_names(df_list, NAMES):
    new_list = [x for x in df_list if x not in NAMES]
    return new_list


df_norm['Full Text'] = df_norm['Full Text'].apply(remove_names, args = ([NAMES]))

print(df_norm)

If you want to get rid of the remove_names function altogether, you can use a lambda function, which updates the column using one line of code instead:如果你想完全摆脱remove_names function,你可以使用lambda function,它使用一行代码更新列:

df_norm['Full Text'] = df_norm['Full Text'].apply(lambda df_list: [x for x in df_list if x not in NAMES])



Note:笔记:

The code above assumes that your df_norm['Full Text'] column looks something like this:上面的代码假定您的df_norm['Full Text']列看起来像这样:

全文



Since you repeatedly need to test for membership of a word to NAMES , you should make NAMES a set rather than a list.由于您反复需要测试一个词是否属于NAMES ,因此您应该使NAMES成为一个集合而不是一个列表。 Testing membership in sets is much faster than testing membership in lists.测试集合中的成员资格比测试列表中的成员资格快得多。

You can use pandas' apply to apply a function to every row of a dataframe.您可以使用pandas 的apply将 function 应用于 dataframe 的每一行。

If a row of your dataframe is a list of words, you can implement the function to apply to every row like this:如果 dataframe 的一行是单词列表,您可以实现 function 以像这样应用于每一行:

def remove_names(list_of_words, set_of_names):
    return [word for word in list_of_words if word not in set_of_names]

# TEST:
print( remove_names(['Alice', 'gives', 'Bob', 'an', 'apple'], {'Alice', 'Bob'}) )
# ['gives', 'an', 'apple']

If a row of your dataframe is a sentence, ie a single string with space-separated words, you can implement the function to apply to every row like this:如果你的 dataframe 的一行是一个句子,即一个用空格分隔的单词的字符串,你可以实现 function 以像这样应用于每一行:

def remove_names(sentence, set_of_names):
    return ' '.join(word for word in sentence.split() if word not in set_of_names)

# TEST:
print( remove_names('Alice gives Bob an apple', {'Alice', 'Bob'}) )
# 'gives an apple'

And then apply it to a column of the dataframe:然后将其应用于 dataframe 的列:

import pandas as pd

df = pd.DataFrame({'id':[47, 28], 'sentence': ['Alice gives Bob an apple', 'An apple is given to Alice']})
df['nonames'] = df['sentence'].apply(remove_names, args=({'Alice', 'Bob'},))

print(df)
#    id                    sentence               nonames
# 0  47    Alice gives Bob an apple        gives an apple
# 1  28  An apple is given to Alice  An apple is given to

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM