简体   繁体   English

Python - 从列表中删除特殊字符

[英]Python - Remove Special Characters from list

I have a list of words and I want to remove all special characters and numbers, here is what I cam up with:我有一个单词列表,我想删除所有特殊字符和数字,这是我想出的:

INPUT: #convert all words to lowercase输入:#将所有单词转换为小写

words = [word.lower() for word in words]
print(words[:100])

OUTPUT:输出:

['rt', '@', 'dark', 'money', 'has', 'played', 'a', 'significant', 'role', 'in', 'the', 'overall', 'increase', 'of', 'election', 'spending', 'in', 'state', 'judicial', 'elections.', 'https://e85zq', 'rt', '@', 'notice,', 'women,', 'how', 'you', 'are', 'always', 'the', 'target', 'of', 'democrats’', 'fear', 'mongering', 'in', 'an', 'election', 'year', 'or', 'scotus', 'confirmation.', 'it', 'is', 'not', 'because', 'our', 'rights', 'are', 'actually', 'at', 'risk.', 'it', 'is', 'because', 'we', 'are', 'easily', 'manipulated.', 'goes', 'allll', 'the', 'way', 'back', 'to', 'eve.', 'resist', 'hysteria', '&', 'think.', 'rt', '@', 'oct', '5:', 'last', 'day', 'to', 'register', 'to', 'vote.', 'oct', '13:', 'early', 'voting', 'starts.', 'oct', '23:', 'last', 'day', 'to', 'request', 'a', 'mail-in', 'ballot.', 'nov', '3:', 'election', 'day', 'rt', '@']

INPUT输入

words_cleaned = [re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", i) for i in words]

print(words_cleaned[:100])

OUTPUT输出

I end up with an empty string []我最终得到一个空字符串 []

What I need is characters like '@' to be removed, and a character like '@test' to turn to 'test'.我需要的是像'@'这样的字符被删除,像'@test'这样的字符变成'test'。 any ideas?有任何想法吗?

If you want to remove all non-letters chars, try:如果要删除所有非字母字符,请尝试:

words = ["".join(filter(lambda c: c.isalpha(), word)) for word in words]
print(words)

You can use built in shortcuts rather than have to specify all of the special characters.您可以使用内置快捷方式,而不必指定所有特殊字符。 Here's a way to remove everything but "word characters":这是一种删除除“单词字符”之外的所有内容的方法:

import re进口重新

inp = ['rt', '@', 'dark', 'money', 'has', 'played', 'a', '#significant', 'role', 'in', 'tRhe', 'overall', 'increase', 'of', 'election', 'spending', 'in', 'state', 'judicial', 'elections.', 'https://e85zq', 'rt', '@', 'notice,', 'women,', 'how', 'you', 'are', 'always', 'the', 'target', 'of', 'democrats’', 'fear', 'mongering', 'in', 'an', 'election', 'year', 'or', 'scotus', 'confirmation.', 'it', 'is', 'not', 'because', 'our', 'rights', 'are', 'actually', 'at', 'risk.', 'it', 'is', 'because', 'we', 'are', 'easily', 'manipulated.', 'goes', 'allll', 'the', 'way', 'back', 'to', 'eve.', 'resist', 'hysteria', '&amp;', 'think.', 'rt', '@', 'oct', '5:', 'last', 'day', 'to', 'register', 'to', 'vote.', 'oct', '13:', 'early', 'voting', 'starts.', 'oct', '23:', 'last', 'day', 'to', 'request', 'a', 'mail-in', 'ballot.', 'nov', '3:', 'election', 'day', 'rt', '@']

outp = [re.sub(r"[^A-Za-z]+", '', s) for s in inp]

print(outp)

Result:结果:

['rt', '', 'dark', 'money', 'has', 'played', 'a', 'significant', 'role', 'in', 'tRhe', 'overall', 'increase', 'of', 'election', 'spending', 'in', 'state', 'judicial', 'elections', 'httpse85zq', 'rt', '', 'notice', 'women', 'how', 'you', 'are', 'always', 'the', 'target', 'of', 'democrats', 'fear', 'mongering', 'in', 'an', 'election', 'year', 'or', 'scotus', 'confirmation', 'it', 'is', 'not', 'because', 'our', 'rights', 'are', 'actually', 'at', 'risk', 'it', 'is', 'because', 'we', 'are', 'easily', 'manipulated', 'goes', 'allll', 'the', 'way', 'back', 'to', 'eve', 'resist', 'hysteria', 'amp', 'think', 'rt', '', 'oct', '5', 'last', 'day', 'to', 'register', 'to', 'vote', 'oct', '13', 'early', 'voting', 'starts', 'oct', '23', 'last', 'day', 'to', 'request', 'a', 'mailin', 'ballot', 'nov', '3', 'election', 'day', 'rt', '']

The ^ character here means match everything NOT mentioned in the set of characters that follow inside a [] pair.这里的^字符表示匹配[]对中后面的字符集中未提及的所有内容。 \\w means "word characters" . \\w表示“单词字符”。 So the whole thing says "match everything but word characters."所以整件事都说“匹配除单词字符之外的所有内容”。 The nice thing about using a regular expression is that you can get arbitrarily precise as to just which characters you want to include or exclude.使用正则表达式的好处是您可以任意精确地确定要包含或排除的字符。

No need to slice the result with [:100 to print it.无需使用[:100对结果进行切片即可打印。 Just print it as is, like I do.就像我一样,按原样打印它。 I assume that by using 100 , you're wanting to make sure you go to the end of the list.我假设通过使用100 ,您希望确保您到达列表的末尾。 The better way to do that is to just leave that component blank.更好的方法是将该组件留空。 So [:] means "take a slice of the string that is the full string", and [5:] means "take from the 6th character to the end of the string".所以[:]意思是“从字符串中取出一个完整的字符串”,而[5:]意思是“从第 6 个字符到字符串的末尾”。

UPDATE: I just noticed that you said you don't want numbers in the result.更新:我刚刚注意到你说你不想要结果中的数字。 So then I guess you just want letters.那么我猜你只想要字母。 I changed the expression to do that.我改变了表达来做到这一点。 This is what's nice about a regular expression.这就是正则表达式的好处。 You can tweak just what gets replaced without adding additional calls, loops, etc. but rather just change a string value.您可以调整被替换的内容,而无需添加额外的调用、循环等,而只需更改字符串值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM