Python解析由JavaScript生成的HTML表

Question

I'm trying to scrape a table from the NYSE website ( http://www1.nyse.com/about/listed/IPO_Index.html ) into a pandas dataframe. 我正试图从纽约证券交易所网站（ http://www1.nyse.com/about/listed/IPO_Index.html ）中将一张桌子刮成一张pandas数据帧。 In order to do so, I have a setup like this: 为了做到这一点，我有这样的设置：

def htmltodf(url):
page = requests.get(url)
soup = BeautifulSoup(page.text)

tables = soup.findAll('table')
test = pandas.io.html.read_html(str(tables))

return(test)            #return dataframe type object

However, when I run this on the page, all of the table returned in the list are essentially empty. 但是，当我在页面上运行它时，列表中返回的所有表都基本上是空的。 When I further investigated, I found that the table is generated by javascript. 当我进一步调查时，我发现该表是由javascript生成的。 When using the developer tools in my web browser, I see that the table looks like any other HTML table with the tags, etc. However, a view of the source code revealed something like this instead: 在我的Web浏览器中使用开发人员工具时，我看到该表看起来像带有标签的任何其他HTML表格等。但是，源代码的视图显示了类似的内容：

<script language="JavaScript">

.
.
.

<script>
var year = [["ICC","21st Century Oncology Holdings, Inc.","22 May  2014","/about/listed/icc.html" ],
... more entries here ...
,["ZOES","Zoe's Kitchen, Inc.","11 Apr 2014","/about/listed/zoes.html" ]] ;

    if(year.length != 0) 
    {   

    document.write ("<table width='619' border='0' cellspacing='0' cellpadding='0'><tr><td><span class='fontbold'>");
    document.write ('2014' + " IPO Showcase"); 
    document.write ("</span></td></tr></table>"); 
    }  
</script>

Therefore, when my HTML parser goes to look for the table tag, all it can find is the if condition, and no proper tags below that would indicate content. 因此，当我的HTML解析器去查找表标记时，它可以找到的只是if条件，下面没有适当的标记表示内容。 How can I scrape this table? 我该怎么刮这张桌子？ Is there a tag that I can search for instead of table that will reveal the content? 是否有可以搜索的标签而不是可以显示内容的表格？ Because the code is not in traditional html table form, how do I read it in with pandas--do I have to manually parse the data? 因为代码不是传统的html表格形式，我如何用pandas读取它 - 我是否必须手动解析数据？

Answer 1

In this case, you need something to run that javascript code for you. 在这种情况下，您需要为您运行该JavaScript代码。

One option here would be to use selenium : 这里的一个选择是使用selenium ：

from pandas.io.html import read_html
from selenium import webdriver


driver = webdriver.Firefox()
driver.get('http://www1.nyse.com/about/listed/IPO_Index.html')

table = driver.find_element_by_xpath('//div[@class="sp5"]/table//table/..')
table_html = table.get_attribute('innerHTML')

df = read_html(table_html)[0]
print df

driver.close()

prints: 打印：

                                                    0        1          2   3
0                                                Name   Symbol        NaT NaN
1                       Performance Sports Group Ltd.      PSG 2014-06-20 NaN
2                           Century Communities, Inc.      CCS 2014-06-18 NaN
3                        Foresight Energy Partners LP     FELP 2014-06-18 NaN
...
79  EGShares TCW EM Long Term Investment Grade Bon...     LEMF 2014-01-08 NaN
80  EGShares TCW EM Short Term Investment Grade Bo...     SEMF 2014-01-08 NaN

[81 rows x 4 columns]

Python解析由JavaScript生成的HTML表

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-07-31 15:18:29

Python解析由JavaScript生成的HTML表

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-07-31 15:18:29

解决方案1
3 已采纳 2014-07-31 15:18:29