[英]Removing words in text files containing a character or string of letters with Python
I have a few lines of text and want to remove any word with special characters or a fixed given string in them (in python). 我有几行文字,想删除其中带有特殊字符或固定给定字符串的任何单词(在python中)。
Example: 例:
in_lines = ['this is go:od',
'that example is bad',
'amp is a word']
# remove any word with {'amp', ':'}
out_lines = ['this is',
'that is bad',
'is a word']
I know how to remove words from a list that is given but cannot remove words with special characters or few letters being present. 我知道如何从给出的列表中删除单词,但是不能删除带有特殊字符或字母很少的单词。 Please let me know and I'll add more information.
请让我知道,我将添加更多信息。
This is what I have for removing selected words: 这是我要删除所选单词的内容:
def remove_stop_words(lines):
stop_words = ['am', 'is', 'are']
results = []
for text in lines:
tmp = text.split(' ')
for stop_word in stop_words:
for x in range(0, len(tmp)):
if tmp[x] == stop_word:
tmp[x] = ''
results.append(" ".join(tmp))
return results
out_lines = remove_stop_words(in_lines)
This matches your expected output: 这符合您的预期输出:
def remove_stop_words(lines):
stop_words = ['am', ':']
results = []
for text in lines:
tmp = text.split(' ')
for x in range(0, len(tmp)):
for st_w in stop_words:
if st_w in tmp[x]:
tmp[x] = ''
results.append(" ".join(tmp))
return results
in_lines = ['this is go:od',
'that example is bad',
'amp is a word']
def remove_words(in_list, bad_list):
out_list = []
for line in in_list:
words = ' '.join([word for word in line.split() if not any([phrase in word for phrase in bad_list]) ])
out_list.append(words)
return out_list
out_lines = remove_words(in_lines, ['amp', ':'])
print (out_lines)
Strange as it sounds, the statement 听起来很奇怪,声明
word for word in line.split() if not any([phrase in word for phrase in bad_list])
does all the hard work here at once. 立即在这里完成所有艰苦的工作。 It creates a list of
True
/ False
values for each phrase in the "bad" list applied to a single word. 它为应用于单个单词的“不良”列表中的每个短语创建一个
True
/ False
值列表。 The any
function condenses this temporary list into a single True
/ False
value again, and if this is False
then the word can safely be copied into the line-based output list. any
函数再次将此临时列表压缩为单个True
/ False
值,如果为False
则可以将该单词安全地复制到基于行的输出列表中。
As an example, the result of removing all words containing an a
looks like this: 例如,删除所有包含
a
单词的结果如下:
remove_words(in_lines, ['a'])
>>> ['this is go:od', 'is', 'is word']
(It is possible to remove the for line in ..
line as well. At that point, readability really starts to suffer, though.) (也可以
for line in ..
行中删除for line in ..
。不过,此时,可读性确实开始受到影响。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.