简体   繁体   English

使用Python删除文本文件中包含字符或字母字符串的单词

[英]Removing words in text files containing a character or string of letters with Python

I have a few lines of text and want to remove any word with special characters or a fixed given string in them (in python). 我有几行文字,想删除其中带有特殊字符或固定给定字符串的任何单词(在python中)。

Example: 例:

in_lines = ['this is go:od', 
            'that example is bad', 
            'amp is a word']

# remove any word with {'amp', ':'}
out_lines = ['this is', 
             'that is bad', 
             'is a word']

I know how to remove words from a list that is given but cannot remove words with special characters or few letters being present. 我知道如何从给出的列表中删除单词,但是不能删除带有特殊字符或字母很少的单词。 Please let me know and I'll add more information. 请让我知道,我将添加更多信息。

This is what I have for removing selected words: 这是我要删除所选单词的内容:

def remove_stop_words(lines):
   stop_words = ['am', 'is', 'are']
   results = []
   for text in lines:
        tmp = text.split(' ')
        for stop_word in stop_words:
            for x in range(0, len(tmp)):
               if tmp[x] == stop_word:
                  tmp[x] = ''
        results.append(" ".join(tmp))
   return results
out_lines = remove_stop_words(in_lines)

This matches your expected output: 这符合您的预期输出:

def remove_stop_words(lines):
  stop_words = ['am', ':']
  results = []
  for text in lines:
    tmp = text.split(' ')
    for x in range(0, len(tmp)):
      for st_w in stop_words:
        if st_w in tmp[x]:
          tmp[x] = ''
    results.append(" ".join(tmp))
  return results
in_lines = ['this is go:od', 
            'that example is bad', 
            'amp is a word']

def remove_words(in_list, bad_list):
    out_list = []
    for line in in_list:
        words = ' '.join([word for word in line.split() if not any([phrase in word for phrase in bad_list]) ])
        out_list.append(words)
    return out_list

out_lines = remove_words(in_lines, ['amp', ':'])
print (out_lines)

Strange as it sounds, the statement 听起来很奇怪,声明

word for word in line.split() if not any([phrase in word for phrase in bad_list])

does all the hard work here at once. 立即在这里完成所有艰苦的工作。 It creates a list of True / False values for each phrase in the "bad" list applied to a single word. 它为应用于单个单词的“不良”列表中的每个短语创建一个True / False值列表。 The any function condenses this temporary list into a single True / False value again, and if this is False then the word can safely be copied into the line-based output list. any函数再次将此临时列表压缩为单个True / False值,如果为False则可以将该单词安全地复制到基于行的输出列表中。

As an example, the result of removing all words containing an a looks like this: 例如,删除所有包含a单词的结果如下:

remove_words(in_lines, ['a'])
>>> ['this is go:od', 'is', 'is word']

(It is possible to remove the for line in .. line as well. At that point, readability really starts to suffer, though.) (也可以for line in ..行中删除for line in .. 。不过,此时,可读性确实开始受到影响。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM