[英]Web Scraping HTML Table Using Python
I think I'm really close, so any help would be appreciated.我想我真的很接近,所以任何帮助将不胜感激。 Trying to scrape Index and Value data from the table titled "Stock Market Activity" on the homepage of NASDAQ:试图从纳斯达克主页上标题为“股票市场活动”的表格中抓取指数和价值数据:
def get_index_prices(NASDAQ_URL):
html = urlopen(NASDAQ_URL).read()
soup = BeautifulSoup(html, "lxml")
for row in soup('table', {'class': 'genTable thin'})[0].tbody('tr'):
tds = row('td')
print "Index: %s, Value: %s" % (tds[0].text, tds[1].text)
print get_index_prices('http://www.nasdaq.com/')
Error reads:错误内容:
list index out of range列表索引超出范围
This table rendered by javascript.此表由 javascript 呈现。 If you look on page source code, before javascript runs, you can see this table like:如果您查看页面源代码,在 javascript 运行之前,您可以看到如下表格:
<div id="HomeIndexTable" class="genTable thin">
<table id="indexTable" class="floatL marginB5px">
<thead>
<tr>
<th>Index</th>
<th>Value</th>
<th>Change Net / %</th>
</tr>
</thead>
<script type="text/javascript">
//<![CDATA[
nasdaqHomeIndexChart.storeIndexInfo("NASDAQ","5053.75","-20.52","0.40","1,938,573,902","5085.22","5053.75");
nasdaqHomeIndexChart.storeIndexInfo("DJIA","17663.54","-92.26","0.52","","17799.96","17662.87");
nasdaqHomeIndexChart.storeIndexInfo("S&P 500","2079.36","-10.05","0.48","","2094.32","2079.34");
nasdaqHomeIndexChart.storeIndexInfo("NASDAQ-100","4648.83","-21.93","0.47","","4681.23","4648.83");
nasdaqHomeIndexChart.storeIndexInfo("NASDAQ-100 PMI","4675.49","4.73","0.10","","4681.98","4675.49");
nasdaqHomeIndexChart.storeIndexInfo("NASDAQ-100 AHI","4647.33","-1.50","0.03","","4670.76","4647.26");
nasdaqHomeIndexChart.storeIndexInfo("Russell 1000","1153.55","-4.85","0.42","","1161.51","1153.54");
nasdaqHomeIndexChart.storeIndexInfo("Russell 2000","1161.86","-3.76","0.32","","1167.65","1159.66");
nasdaqHomeIndexChart.storeIndexInfo("FTSE All-World ex-US*","271.15","-0.23","0.08","","272.33","271.13");
nasdaqHomeIndexChart.storeIndexInfo("FTSE RAFI 1000*","9045.08","-34.52","0.38","","9109.74","9044.91");
//]]>
nasdaqHomeIndexChart.displayIndexes();
</script>
</table>
</div>
You can use selenium for scraping.您可以使用硒进行刮擦。 Selenium can execute javascript. Selenium 可以执行 javascript。
I would go for selenium as below-我会去硒如下 -
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
def get_index_prices(NASDAQ_URL):
driver.get(NASDAQ_URL)
WebDriverWait(driver,1000).until(EC.presence_of_all_elements_located((By.XPATH,"//table [@id='indexTable']")))
table = driver.find_element_by_xpath("//table [@id='indexTable']")
for td in table.find_elements_by_tag_name('tr')[1:]:
company = td.find_element_by_xpath(".//following::*[2]")
value = td.find_element_by_xpath(".//following::*[3]")
print "Index {0:<30} Value {1} ".format(company.text.encode('utf-8'),value.text.encode('utf-8'))
driver.quit()
get_index_prices('http://www.nasdaq.com/')
It prints-它打印-
Index NASDAQ Value 5053.75
Index NASDAQ-100 (NDX) Value 4648.83
Index Pre-Market (NDX) Value 4675.49
Index After Hours (NDX) Value 4647.33
Index DJIA Value 17663.54
Index S&P 500 Value 2079.36
Index Russell 2000 Value 1161.86
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.