简体   繁体   中英

Python BeautifulSoup Ampersand issue Mac vs. Linux Ubuntu

I've read that BeautifulSoup has problems with ampersands (&) which are not strictly correct in HTML but still interpreted correctly by most browsers. However weirdly I'm getting different behaviour on a Mac system and on a Ubuntu system, both using bs4 version 4.3.2:

html='<td>S&P500</td>'
s=bs4.BeautifulSoup(html)

On the Ubuntu system s is equal to:

<td>S&amp;P500;</td>

Notice the added semicolon at the end which is a real problem

On the mac system:

<html><head></head><body>S&amp;P500</body></html>

Never mind the html/head/body tags, I can deal with that, but notice S&P 500 is correctly interpreted this time, without the added ";".

Any idea what's going on? How to make cross-platform code without resorting to an ugly hack? Thanks a lot,

First I can't reproduce the mac results using python2.7.1 and beautifulsoup4.3.2, that is I am getting the extra semicolon on all systems.

The easy fix is a) use strictly valid HTML, or b) add a space after the ampersand. Chances are you can't change the source, and if you could parse out and replace these in python you wouldn't be needing BeautifulSoup ;)

So the problem is that the BeautifulSoupHTMLParser first converts S&P500 to S&P500; because it assumes P500 is the character name and you just forgot the semicolon.

Then later it reparses the string and finds &P500; . Now it doesn't recognize P500 as a valid name and converts the & to &amp; without touching the rest.

Here is a stupid monkeypatch only to demonstrate my point . I don't know the inner workings of BeautifulSoup well enough to propose a proper solution.

from bs4 import BeautifulSoup
from bs4.builder._htmlparser import BeautifulSoupHTMLParser
from bsp.dammit import EntitySubstitution

def handle_entityref(self, name):
    character = EntitySubstitution.HTML_ENTITY_TO_CHARACTER.get(name)
    if character is not None:
        data = character
    else:
        # Previously was
        # data = "&%s;" % name
        data = "&%s" % name
    self.handle_data(data)

html = '<td>S&P500</td>'

# Pre monkeypatching
# <td>S&amp;P500;</td>
print(BeautifulSoup(html))

BeautifulSoupHTMLParser.handle_entityref = handle_entityref

# Post monkeypatching    
# <td>S&amp;P500</td>
print(BeautifulSoup(html))

Hopefully someone more versed in bs4 can give you a proper solution, good luck.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM