简体   繁体   中英

Removing a word that contains symbols such as "#", "@", or ":" in python

I have just started learning Python coding this semester and we are given some revision exercise. However i am stuck on one of the question. The text file given are tweets from US elections in 2016. Sample as below:

I wish they would show out takes of Dick Cheney #GOPdebates
Candidates went after @HillaryClinton 32 times in the #GOPdebate-but remained silent about the issues that affect us. 
It seems like Ben Carson REALLY doesn't want to be there. #GOPdebates
RT @ColorOfChange: Or better said: #KKKorGOP #GOPDebate

The question requires me to write a Python program that reads from the file tweets.txt. Remember that each line contains one tweet. For each tweet, your program should remove any word that is less than 8 characters long, and also any word that contains a hash (#), at (@), or colon (:) character. What i have now:

for line in open("tweets.txt"):
  aline=line.strip()
  words=aline.split()
  length=len(words)
  remove=['#','@',':']
  for char in words:
    if "#" in char:
      char=''
    if "@" in char:
      char=''
    if ":" in char:
      char=''

which did not work, and the resulting list still contains @,# or:. Any help appreciated! Thank you!

Assigning char='' in the loop does not change or remove the actual char (actually a word) in the list, it just assign a different value to the variable char .

Instead, you might use a list comprehension / generator expression for filtering the words that satisfy the conditions.

>>> tweet = "Candidates went after @HillaryClinton 32 times in the #GOPdebate-but remained silent about the issues that affect us."
>>> [w for w in tweet.split() if not any(c in w for c in "#@:") and len(w) >= 8]
['Candidates', 'remained']

Optionally, use ' '.join(...) to join the remaining words back to a "sentence", although that might not make too much sense.

Use this code.

import re
tweet=re.sub(r'#', '',tweet )
tweet=re.sub(r'@', '',tweet )
tweet=re.sub(r':', '',tweet )

The below will open the file (it's usually better to use "with open" when working with files), loop through all the lines and remove the '#@:' using translate. Then remove the words with less than 8 characters giving you the output "new_line".

with open('tweets.txt') as rf:
    for sentence in rf:
        line = sentence.strip()
        line = line.translate({ord(i): None for i in '#@:'})
        line = line.split()
        new_line = [ word for word in line if len(word) >= 8 ]
        print(new_line)

It's not the most succinct way and there's definitely better ways to do it but it's probably a bit easier to read and understand seen as though you've just started learning, like me.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM