I have just started learning Python coding this semester and we are given some revision exercise. However i am stuck on one of the question. The text file given are tweets from US elections in 2016. Sample as below:
I wish they would show out takes of Dick Cheney #GOPdebates
Candidates went after @HillaryClinton 32 times in the #GOPdebate-but remained silent about the issues that affect us.
It seems like Ben Carson REALLY doesn't want to be there. #GOPdebates
RT @ColorOfChange: Or better said: #KKKorGOP #GOPDebate
The question requires me to write a Python program that reads from the file tweets.txt. Remember that each line contains one tweet. For each tweet, your program should remove any word that is less than 8 characters long, and also any word that contains a hash (#), at (@), or colon (:) character. What i have now:
for line in open("tweets.txt"):
aline=line.strip()
words=aline.split()
length=len(words)
remove=['#','@',':']
for char in words:
if "#" in char:
char=''
if "@" in char:
char=''
if ":" in char:
char=''
which did not work, and the resulting list still contains @,# or:. Any help appreciated! Thank you!
Assigning char=''
in the loop does not change or remove the actual char (actually a word) in the list, it just assign a different value to the variable char
.
Instead, you might use a list comprehension / generator expression for filtering the words that satisfy the conditions.
>>> tweet = "Candidates went after @HillaryClinton 32 times in the #GOPdebate-but remained silent about the issues that affect us."
>>> [w for w in tweet.split() if not any(c in w for c in "#@:") and len(w) >= 8]
['Candidates', 'remained']
Optionally, use ' '.join(...)
to join the remaining words back to a "sentence", although that might not make too much sense.
Use this code.
import re
tweet=re.sub(r'#', '',tweet )
tweet=re.sub(r'@', '',tweet )
tweet=re.sub(r':', '',tweet )
The below will open the file (it's usually better to use "with open" when working with files), loop through all the lines and remove the '#@:' using translate. Then remove the words with less than 8 characters giving you the output "new_line".
with open('tweets.txt') as rf:
for sentence in rf:
line = sentence.strip()
line = line.translate({ord(i): None for i in '#@:'})
line = line.split()
new_line = [ word for word in line if len(word) >= 8 ]
print(new_line)
It's not the most succinct way and there's definitely better ways to do it but it's probably a bit easier to read and understand seen as though you've just started learning, like me.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.