[英]Parsing HTML and writing to CSV using Beautifulsoup - AttributeError or no html being parsed
[英]Parsing a long html using BeautifulSoup failed with half parsed output
我使用以下脚本来解析特定基金的基金价格:
import pandas as pd
from bs4 import BeautifulSoup
from ghost import Ghost
ghost = Ghost()
page,resources = ghost.open('http://bank.hangseng.com/1/PA_1_1_P1/ComSvlet_MiniSite_eng_gif?app=eINVCFundDetailsOV&pri_fund_code=U44217')
page,resources = ghost.evaluate("agree()", expect_loading=True)
page,resources = ghost.evaluate("MM_changeview('eINVCFundPriceDividend')", expect_loading=True)
# ghost.capture_to("hangseng.png")
soup = BeautifulSoup(page.content)
soup
上半部分的输出soup
很好,但是标记全部变为大写,BeautifulSoup无法解析它们,就像下面的一样:
<td class="LightGrey" valign="TOP"><font class="CONTENT">22-07-2014</font></td><td class="LightGrey" valign="TOP"><font class="CONTENT">10.95000</font></td><td class="LightGrey" valign="TOP"><font class="CONTENT">11.39000</font></td><td class="LightGrey" valign="TOP"><font class="CONTENT">10.95000</font></td>
</tr>
T R V A L I G N = " t o p " a l i g n = " c e n t e r " >
T D C L A S S = " L i g h t G r e y " V A L I G N = " T O P " > F O N T C L A S S = " C O N T E N T " > 2 1 - 0 7 - 2 0 1 4 / F O N T > / T D > T D C L A S S = " L i g h t G r e y " V A L I G N = " T O P " > F O N T C L A S S = " C O N T E N T " > 1 0 . 9 6 0 0 0 / F O N T > / T D > T D C L A S S = " L i g h t G r e y " V A L I G N = " T O P " > F O N T C L A S S = " C O N T E N T " > 1 1 . 4 0 0 0 0 / F O N T > / T D > T D C L A S S = " L i g h t G r e y " V A L I G N = " T O P " > F O N T C L A S S = " C O N T E N T " > 1 0 . 9 6 0 0 0 / F O N T > / T D >
/ T R >
您可以看到在2014-07-22
日期之后输出变为垃圾。
发生了什么?
我从间隔输出的beautifulsoup中找到了解决方案
page.content
soup = BeautifulSoup(page.content,'html.parser')
现在,它可以完美运行了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.