簡體   English   中英

使用BeautifulSoup解析長HTML失敗,輸出解析為一半

[英]Parsing a long html using BeautifulSoup failed with half parsed output

我使用以下腳本來解析特定基金的基金價格:

import pandas as pd
from bs4 import BeautifulSoup
from ghost import Ghost
ghost = Ghost()
page,resources = ghost.open('http://bank.hangseng.com/1/PA_1_1_P1/ComSvlet_MiniSite_eng_gif?app=eINVCFundDetailsOV&pri_fund_code=U44217')
page,resources = ghost.evaluate("agree()", expect_loading=True)
page,resources = ghost.evaluate("MM_changeview('eINVCFundPriceDividend')", expect_loading=True)
# ghost.capture_to("hangseng.png")
soup = BeautifulSoup(page.content)
soup

上半部分的輸出soup很好,但是標記全部變為大寫,BeautifulSoup無法解析它們,就像下面的一樣:

<td class="LightGrey" valign="TOP"><font class="CONTENT">22-07-2014</font></td><td class="LightGrey" valign="TOP"><font class="CONTENT">10.95000</font></td><td class="LightGrey" valign="TOP"><font class="CONTENT">11.39000</font></td><td class="LightGrey" valign="TOP"><font class="CONTENT">10.95000</font></td>
</tr>
 T R   V A L I G N = " t o p "   a l i g n = " c e n t e r " &gt; 
 T D   C L A S S = " L i g h t G r e y "   V A L I G N = " T O P " &gt; F O N T   C L A S S = " C O N T E N T " &gt; 2 1 - 0 7 - 2 0 1 4 / F O N T &gt; / T D &gt; T D   C L A S S = " L i g h t G r e y "   V A L I G N = " T O P " &gt; F O N T   C L A S S = " C O N T E N T " &gt; 1 0 . 9 6 0 0 0 / F O N T &gt; / T D &gt; T D   C L A S S = " L i g h t G r e y "   V A L I G N = " T O P " &gt; F O N T   C L A S S = " C O N T E N T " &gt; 1 1 . 4 0 0 0 0 / F O N T &gt; / T D &gt; T D   C L A S S = " L i g h t G r e y "   V A L I G N = " T O P " &gt; F O N T   C L A S S = " C O N T E N T " &gt; 1 0 . 9 6 0 0 0 / F O N T &gt; / T D &gt; 
 / T R &gt; 

您可以看到在2014-07-22日期之后輸出變為垃圾。

發生了什么?

我從間隔輸出的beautifulsoup中找到了解決方案

page.content
soup = BeautifulSoup(page.content,'html.parser')

現在,它可以完美運行了。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM