简体   繁体   English

去除python中包含“#”、“@”、“:”等符号的单词

[英]Removing a word that contains symbols such as "#", "@", or ":" in python

I have just started learning Python coding this semester and we are given some revision exercise.我这学期刚开始学习 Python 编码,我们得到了一些复习练习。 However i am stuck on one of the question.但是我被困在其中一个问题上。 The text file given are tweets from US elections in 2016. Sample as below:给出的文本文件是 2016 年美国大选的推文。示例如下:

I wish they would show out takes of Dick Cheney #GOPdebates
Candidates went after @HillaryClinton 32 times in the #GOPdebate-but remained silent about the issues that affect us. 
It seems like Ben Carson REALLY doesn't want to be there. #GOPdebates
RT @ColorOfChange: Or better said: #KKKorGOP #GOPDebate

The question requires me to write a Python program that reads from the file tweets.txt.该问题要求我编写一个 Python 程序,该程序从文件 tweets.txt 中读取。 Remember that each line contains one tweet.请记住,每一行都包含一条推文。 For each tweet, your program should remove any word that is less than 8 characters long, and also any word that contains a hash (#), at (@), or colon (:) character.对于每条推文,您的程序应删除任何长度少于 8 个字符的单词,以及任何包含 hash (#)、at (@) 或冒号 (:) 字符的单词。 What i have now:我现在拥有的:

for line in open("tweets.txt"):
  aline=line.strip()
  words=aline.split()
  length=len(words)
  remove=['#','@',':']
  for char in words:
    if "#" in char:
      char=''
    if "@" in char:
      char=''
    if ":" in char:
      char=''

which did not work, and the resulting list still contains @,# or:.这不起作用,结果列表仍然包含@、# 或:。 Any help appreciated!任何帮助表示赞赏! Thank you!谢谢!

Assigning char='' in the loop does not change or remove the actual char (actually a word) in the list, it just assign a different value to the variable char .在循环中分配char=''不会更改或删除列表中的实际 char (实际上是一个单词),它只是为变量char分配了一个不同的值。

Instead, you might use a list comprehension / generator expression for filtering the words that satisfy the conditions.相反,您可以使用列表理解/生成器表达式来过滤满足条件的单词。

>>> tweet = "Candidates went after @HillaryClinton 32 times in the #GOPdebate-but remained silent about the issues that affect us."
>>> [w for w in tweet.split() if not any(c in w for c in "#@:") and len(w) >= 8]
['Candidates', 'remained']

Optionally, use ' '.join(...) to join the remaining words back to a "sentence", although that might not make too much sense.可选地,使用' '.join(...)将剩余的单词连接回“句子”,尽管这可能没有太大意义。

Use this code.使用此代码。

import re
tweet=re.sub(r'#', '',tweet )
tweet=re.sub(r'@', '',tweet )
tweet=re.sub(r':', '',tweet )

The below will open the file (it's usually better to use "with open" when working with files), loop through all the lines and remove the '#@:' using translate.下面将打开文件(在处理文件时通常最好使用“with open”),遍历所有行并使用翻译删除“#@:”。 Then remove the words with less than 8 characters giving you the output "new_line".然后删除少于 8 个字符的单词,得到 output“new_line”。

with open('tweets.txt') as rf:
    for sentence in rf:
        line = sentence.strip()
        line = line.translate({ord(i): None for i in '#@:'})
        line = line.split()
        new_line = [ word for word in line if len(word) >= 8 ]
        print(new_line)

It's not the most succinct way and there's definitely better ways to do it but it's probably a bit easier to read and understand seen as though you've just started learning, like me.这不是最简洁的方法,而且肯定有更好的方法来做到这一点,但它可能更容易阅读和理解,就像您像我一样刚刚开始学习一样。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM