简体   繁体   中英

Removing words in text files containing a character or string of letters with Python

I have a few lines of text and want to remove any word with special characters or a fixed given string in them (in python).

Example:

in_lines = ['this is go:od', 
            'that example is bad', 
            'amp is a word']

# remove any word with {'amp', ':'}
out_lines = ['this is', 
             'that is bad', 
             'is a word']

I know how to remove words from a list that is given but cannot remove words with special characters or few letters being present. Please let me know and I'll add more information.

This is what I have for removing selected words:

def remove_stop_words(lines):
   stop_words = ['am', 'is', 'are']
   results = []
   for text in lines:
        tmp = text.split(' ')
        for stop_word in stop_words:
            for x in range(0, len(tmp)):
               if tmp[x] == stop_word:
                  tmp[x] = ''
        results.append(" ".join(tmp))
   return results
out_lines = remove_stop_words(in_lines)

This matches your expected output:

def remove_stop_words(lines):
  stop_words = ['am', ':']
  results = []
  for text in lines:
    tmp = text.split(' ')
    for x in range(0, len(tmp)):
      for st_w in stop_words:
        if st_w in tmp[x]:
          tmp[x] = ''
    results.append(" ".join(tmp))
  return results
in_lines = ['this is go:od', 
            'that example is bad', 
            'amp is a word']

def remove_words(in_list, bad_list):
    out_list = []
    for line in in_list:
        words = ' '.join([word for word in line.split() if not any([phrase in word for phrase in bad_list]) ])
        out_list.append(words)
    return out_list

out_lines = remove_words(in_lines, ['amp', ':'])
print (out_lines)

Strange as it sounds, the statement

word for word in line.split() if not any([phrase in word for phrase in bad_list])

does all the hard work here at once. It creates a list of True / False values for each phrase in the "bad" list applied to a single word. The any function condenses this temporary list into a single True / False value again, and if this is False then the word can safely be copied into the line-based output list.

As an example, the result of removing all words containing an a looks like this:

remove_words(in_lines, ['a'])
>>> ['this is go:od', 'is', 'is word']

(It is possible to remove the for line in .. line as well. At that point, readability really starts to suffer, though.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM