简体   繁体   English

用Python解码和编码

[英]Decoding and Encoding in Python

I have some text that I am trying to decode and encode in Python 我有一些要在Python中解码和编码的文本

import html.parser

original_tweet = "I luv my <3 iphone & you’re awsm 
                 apple.DisplayIsAwesome, sooo happppppy 🙂 
                 http://www.apple.com"
tweet = original_tweet.decode("utf8").encode('ascii', 'ignore')

I have entered the original tweet on one line in Spyder (Python 3.6) 我已经在Spyder(Python 3.6)的一行中输入了原始推文

I get the following message 我收到以下消息

AttributeError: 'str' object has no attribute 'decode'

Is there an alternative way to rewrite this code for Python 3.6? 有没有其他方法可以为Python 3.6重写此代码?

In Python3+, your original_tweet string is a UTF-8 encoded Unicode string containing a Unicode emoji . 在Python3 +中,您的original_tweet字符串是UTF-8编码的Unicode字符串,其中包含Unicode emoji表情 Because the 65k+ characters in Unicode are a superset of the 256 ASCII characters, you can not simply convert a Unicode string into an ASCII string. 由于Unicode中的65k +字符是256个ASCII字符的超集,因此您不能简单地将Unicode字符串转换为ASCII字符串。

However, if you can live with some data loss (ie drop the emoji) then you can try the following (see this or this related question): 但是,如果你可以用一些数据丢失的生活(即下降表情符号),那么你可以尝试以下方法(见相关的问题):

original_tweet = "I luv my <3 iphone & you’re awsm ..."

# Convert the original UTF8 encoded string into an array of bytes.
original_tweet_bytes = original_tweet.encode("utf-8")

# Decode that array of bytes into a string containing only ASCII characters;
# pass errors="strict" to find failing character mappings, and I also suggest
# to read up on the option errors="replace".
original_tweet_ascii = original_tweet_bytes.decode("ascii", errors="ignore")

Or as a simple one-liner: 或作为简单的单线:

tweet = original_tweet.encode("utf-8").decode("ascii", errors="ignore")

Note that this does not convert the HTML entities < 请注意,这转换的HTML实体 < and & & which you may have to address separately. 您可能需要分别解决。 You can do that using a proper HTML parser (eg lxml ), or use a simple string replacement : 您可以使用适当的HTML解析器(例如lxml )或使用简单的字符串替换来做到这一点:

tweet = tweet.replace("&lt;", "<").replace("&amp;", "&")

Or as of Python 3.4+ you can use html.unescape() like so: 或者从Python 3.4+开始,您可以像这样使用html.unescape()

tweet = html.unescape(tweet)

See also this question on how to handle HTML entities in strings. 另请参阅此问题,以了解如何处理字符串中的HTML实体。

Addendum. 附录。 The Unidecode package for Python seems to provide useful functionality for this, too, although in its current version it does not handle emojis. Python的Unidecode软件包似乎也为此提供了有用的功能,尽管在其当前版本中它不处理表情符号。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM