从列表中删除自定义词 - Python

Question

I have a dataframe column that looks like:我有一个 dataframe 列，如下所示：

I'm looking into removing special characters.我正在考虑删除特殊字符。 I' hoping to attach the tags (in list of lists) so that I can append the column to an existing df.我希望附加标签（在列表列表中），以便我可以 append 该列到现有的 df。

This is what I gathered so much, but it doesn't seem to work.这就是我收集了这么多，但它似乎不起作用。 Regex in particular is causing me so much pain as it always returns "expected string or byte-like objects".特别是正则表达式让我非常痛苦，因为它总是返回“预期的字符串或类似字节的对象”。

df = pd.read_csv('flickr_tags_participation_inequality_omit.csv')
#df.dropna(inplace=True) and tokenise
tokens = df["tags"].astype(str).apply(nltk.word_tokenize)

filter_words = ['.',',',':',';','?','@','-','...','!','=', 'edinburgh', 'ecosse', 'écosse', 'scotland']
filtered = [i for i in tokens if i not in filter_words]
#filtered = [re.sub("[.,!?:;-=...@#_]", '', w) for w in tokens]
#the above line didn't work


tokenised_tags= []
for i in filtered:
    tokenised_tags.append(i) #this turns the single lists of tags into lists of lists
print(tokenised_tags)

The above code doesn't remove the custom-defined stopwords.上面的代码不会删除自定义的停用词。

Any help is very much appreciated!很感谢任何形式的帮助！ Thanks!谢谢！

Answer 1

You need to use你需要使用

df['filtered'] = df['tags'].apply(lambda x: [t for t in nltk.word_tokenize(x) if t not in filter_words])

Note that nltk.word_tokenize(x) outputs a list of strings so you can apply a regulat list comprehension to it.请注意nltk.word_tokenize(x)输出一个字符串列表，因此您可以对其应用规则列表理解。

从列表中删除自定义词 - Python

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-04-30 21:01:15

从列表中删除自定义词 - Python

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-04-30 21:01:15

解决方案1
1 已采纳 2022-04-30 21:01:15