简体   繁体   中英

Decoding and Encoding in Python

I have some text that I am trying to decode and encode in Python

import html.parser

original_tweet = "I luv my <3 iphone & you’re awsm 
                 apple.DisplayIsAwesome, sooo happppppy 🙂 
                 http://www.apple.com"
tweet = original_tweet.decode("utf8").encode('ascii', 'ignore')

I have entered the original tweet on one line in Spyder (Python 3.6)

I get the following message

AttributeError: 'str' object has no attribute 'decode'

Is there an alternative way to rewrite this code for Python 3.6?

In Python3+, your original_tweet string is a UTF-8 encoded Unicode string containing a Unicode emoji . Because the 65k+ characters in Unicode are a superset of the 256 ASCII characters, you can not simply convert a Unicode string into an ASCII string.

However, if you can live with some data loss (ie drop the emoji) then you can try the following (see this or this related question):

original_tweet = "I luv my <3 iphone & you’re awsm ..."

# Convert the original UTF8 encoded string into an array of bytes.
original_tweet_bytes = original_tweet.encode("utf-8")

# Decode that array of bytes into a string containing only ASCII characters;
# pass errors="strict" to find failing character mappings, and I also suggest
# to read up on the option errors="replace".
original_tweet_ascii = original_tweet_bytes.decode("ascii", errors="ignore")

Or as a simple one-liner:

tweet = original_tweet.encode("utf-8").decode("ascii", errors="ignore")

Note that this does not convert the HTML entities < and & which you may have to address separately. You can do that using a proper HTML parser (eg lxml ), or use a simple string replacement :

tweet = tweet.replace("&lt;", "<").replace("&amp;", "&")

Or as of Python 3.4+ you can use html.unescape() like so:

tweet = html.unescape(tweet)

See also this question on how to handle HTML entities in strings.

Addendum. The Unidecode package for Python seems to provide useful functionality for this, too, although in its current version it does not handle emojis.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM