简体   繁体   中英

BeautifulSoup soup.prettify() gives strange output

I'm trying to parse a web site and I'm going to use it later in my Django project. To do that, I'm using urllib2 and BeautifulSoup4. However, I couldn't get what I want. The output of BeautifulSoup object is weird. I tried different pages, it worked (output is normal). I thought it is because of the page. Then, when my friend tried to do the same thing, he got normal output. I couldn't manage to figure out problem.

This is the website I'm going to parse.

This is an example of the weird output after the command "soup.prettify()":

t   d       B   G   C   O   L   O   R   =   "   #   9   9   0   4   0   4   "       w   i   d   t   h   =   "   3   "   &gt;   i   m   g       S   R   C   =   "   1   p   .   g   i   f   "       A   L   T       B   O   R   D   E   R   =   "   0   "       h   e   i   g   h   t   =   "   1   "       w   i   d   t   h   =   "   3   "   &gt;   /   t   d   &gt;   \n           /   t   r   &gt;   \n           t   r   &gt;   \n                   t   d       c   o   l   s   p   a   n   =   "   3   "       B   G   C   O   L   O   R   =   "   #   9   9   0   4   0   4   "       w   i   d   t   h   =   "   6   0   0   "       h   e   i   g   h   t   =   "   3   "   &gt;   i   m   g       s   r   c   =   "   1   p   .   g   i   f   "       w   i   d   t   h   =   "   6   0   0   "   \n                   h   e   i   g   h   t   =   "   1   "   &gt;   /   t   d   &gt;   \n           /   t   r   &gt;   \n   /   t   a   b   l   e   &gt;   \n   /   c   e   n   t   e   r   &gt;   /   d   i   v   &gt;   \n   \n   p   &gt;   &amp;n   b   s   p   ;   &amp;n   b   s   p   ;   &amp;n   b   s   p   ;   &amp;n   b   s   p   ;   /   p   &gt;   \n   /   b   o   d   y   &gt;   \n   /   h   t   m   l   &gt;\n  </p>\n </body>\n</html>'

Here is a minimal example that does work for me, including the snippet of html that you have a problem with. It's hard to tell without your code, but my guess is you did something like ' '.join(A.split()) somewhere.

import urllib2, bs4

url = "http://kafemud.bilkent.edu.tr/monu_tr.html"
req = urllib2.urlopen(url)
raw = req.read()
soup = bs4.BeautifulSoup(raw)

print soup.prettify().encode('utf-8')

Giving:

....
<td bgcolor="#990404" width="3">
       <img alt="" border="0" src="1p.gif" width="3"/>
      </td>
      <td bgcolor="#FFFFFF" valign="TOP">
       <div align="left">
        <table align="left" border="0" cellpadding="10" cellspacing="0" valign="TOP" width="594">
         <tr>
          <td align="left" valign="top">
           <table align="left" border="0" cellpadding="0" cellspacing="0" class="icerik" width="574">
....

Possibly you and your friend use different parsers. BeautifulSoup will use the parser it considers "best", and thus prefer lxml for speed reasons (if installed). If using recent versions of Python (and the last version of the included parser), there are cases which are handled better by BeautifulSoup(text, 'html.parser') ; this is the case eg when there are unmasked < characters (instead of &lt; ) in text content.

This looks like you have your XML coming in with an encoding that beautifulsoup isn't expecting. My guess is that your XML is in UTF-16 and beautifulsoup is reading it as UTF-8. Python offers the .encode and .decode functions for switching between different encodings. Something like

myXmlStr.encode("utf-16").decode("utf-8")

Would probably solve your problem if the issue is your incoming XML encoding. I'm new to beautiful soup myself, but a quick google suggests that if the problem is the encoding of the output, prettify accepts an encoding parameter:

soup.prettify("utf-16")

Without more information I can't give you a clearer answer - but hopefully this points you in a helpful direction.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM