简体   繁体   中英

Decoding html encoded strings in python

I have the following string...

"Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process."

I need to turn it into this string...

Scam, hoax, or the real deal, he's gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process.

This is pretty standard HTML encoding and I can't for the life of me figure out how to convert it in python.

I found this: GitHub

And it's very close to working, however it does not output an apostrophe but instead some off unicode character.

Here is an example of the output from the GitHub script...

Scam, hoax, or the real deal, heâs gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process.

What's you're trying to do is called "HTML entity decoding" and it's covered in a number of past Stack Overflow questions, for example:

Here's a code snippet using the Beautiful Soup HTML parsing library to decode your example:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup

string = "Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process."
s = BeautifulSoup(string,convertEntities=BeautifulSoup.HTML_ENTITIES).contents[0]
print s

Here's the output:

Scam, hoax, or the real deal, he's gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM