简体   繁体   English

代码不从字典中删除所需的值

[英]Code not removing desired values from dictionary

Linked Removing escaped entities from a String in Python 链接从Python中的String中删除转义的实体

My code is reading in a big csv file of tweets and parsing it to two dictionaries (depending on the sentiment of the tweets). 我的代码正在阅读一个大的推文文件,并将其解析为两个词典(取决于推文的情绪)。 I then create a new dictionary and unescape everything using HTML parser before using the translate() method to remove all punctuation from the text. 然后,我使用translate()方法从文本中删除所有标点符号,然后使用HTML解析器创建一个新词典并对其进行unescape。
Finally, I am trying to only keep words that are greater than length = 3. 最后,我试图只保留大于length = 3的单词。
This is my code: 这是我的代码:

tweets = []
for (text, sentiment) in pos_tweets.items() + neg_tweets.items():
    text = HTMLParser.HTMLParser().unescape(text.decode('ascii'))
    remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
    shortenedText = [e.lower() and e.translate(remove_punctuation_map) for e in text.split() if len(e) >= 3 and not e.startswith(('http', '@')) ]
    print shortenedText

What I'm finding however is that whilst most of what I want is being done, I am still getting words that are of length two (not length one however) and I'm getting a few blank entries in my dictionary. 然而,我发现虽然我想要的大部分内容都已完成,但我仍然会得到长度为2的单词(不过长度为1)而且我的字典中有一些空白条目。
For example: 例如:

(: !!!!!! - so I wrote something last week
* enough said *
.... Do I need to say it?

Produces: 生产:

[u'', u'wrote', u'something', u'last', u'week']
[u'enough', u'said']
[u'', u'need', u'even', u'say', u'it']

What's wrong with my code? 我的代码出了什么问题? How can I remove all words less than length two including blank entries? 如何删除长度小于2的所有单词,包括空白条目?

I think your problem is that when you test whether len(e) >= 3, e still contains punctuation, so "it?" 我认为你的问题是当你测试len(e)> = 3时,e仍然包含标点符号,所以“它?” is not filtered out. 没有过滤掉。 Maybe do it in two steps? 也许分两步完成? Clean e of punctuation, then filter for size? 清除标点符号,然后筛选大小?

Something like 就像是

cleanedText = [e.translate(remove_punctuation_map).lower() for e in text.split() if not e.startswith(('http', '@')) ]
shortenedText = [e for e in cleanedText if len(e) >= 3]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM