简体   繁体   中英

Web Scraping HTML Table Using Python

I think I'm really close, so any help would be appreciated. Trying to scrape Index and Value data from the table titled "Stock Market Activity" on the homepage of NASDAQ:

def get_index_prices(NASDAQ_URL):
    html = urlopen(NASDAQ_URL).read()    
    soup = BeautifulSoup(html, "lxml")      
    for row in soup('table', {'class': 'genTable thin'})[0].tbody('tr'):
        tds = row('td')
        print "Index: %s, Value: %s" % (tds[0].text, tds[1].text)


print get_index_prices('http://www.nasdaq.com/')

Error reads:

list index out of range

This table rendered by javascript. If you look on page source code, before javascript runs, you can see this table like:

<div id="HomeIndexTable" class="genTable thin">
    <table id="indexTable" class="floatL marginB5px">
        <thead>
        <tr>
            <th>Index</th>
            <th>Value</th>
            <th>Change Net / %</th>
        </tr>
        </thead>
        <script type="text/javascript">
            //<![CDATA[

                nasdaqHomeIndexChart.storeIndexInfo("NASDAQ","5053.75","-20.52","0.40","1,938,573,902","5085.22","5053.75");
                nasdaqHomeIndexChart.storeIndexInfo("DJIA","17663.54","-92.26","0.52","","17799.96","17662.87");
                nasdaqHomeIndexChart.storeIndexInfo("S&P 500","2079.36","-10.05","0.48","","2094.32","2079.34");
                nasdaqHomeIndexChart.storeIndexInfo("NASDAQ-100","4648.83","-21.93","0.47","","4681.23","4648.83");
                nasdaqHomeIndexChart.storeIndexInfo("NASDAQ-100 PMI","4675.49","4.73","0.10","","4681.98","4675.49");
                nasdaqHomeIndexChart.storeIndexInfo("NASDAQ-100 AHI","4647.33","-1.50","0.03","","4670.76","4647.26");
                nasdaqHomeIndexChart.storeIndexInfo("Russell 1000","1153.55","-4.85","0.42","","1161.51","1153.54");
                nasdaqHomeIndexChart.storeIndexInfo("Russell 2000","1161.86","-3.76","0.32","","1167.65","1159.66");
                nasdaqHomeIndexChart.storeIndexInfo("FTSE All-World ex-US*","271.15","-0.23","0.08","","272.33","271.13");
                nasdaqHomeIndexChart.storeIndexInfo("FTSE RAFI 1000*","9045.08","-34.52","0.38","","9109.74","9044.91");
            //]]>
            nasdaqHomeIndexChart.displayIndexes();
        </script>
    </table>
</div>

You can use selenium for scraping. Selenium can execute javascript.

I would go for selenium as below-

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0
from selenium.webdriver.common.by import By


driver = webdriver.Firefox()
def get_index_prices(NASDAQ_URL):
    driver.get(NASDAQ_URL)
    WebDriverWait(driver,1000).until(EC.presence_of_all_elements_located((By.XPATH,"//table [@id='indexTable']")))
    table = driver.find_element_by_xpath("//table [@id='indexTable']")
    for td in table.find_elements_by_tag_name('tr')[1:]:
        company = td.find_element_by_xpath(".//following::*[2]")
        value = td.find_element_by_xpath(".//following::*[3]")
        print "Index  {0:<30} Value  {1} ".format(company.text.encode('utf-8'),value.text.encode('utf-8'))
    driver.quit()


get_index_prices('http://www.nasdaq.com/')

It prints-

Index  NASDAQ                         Value  5053.75 
Index  NASDAQ-100 (NDX)               Value  4648.83 
Index  Pre-Market (NDX)               Value  4675.49 
Index  After Hours (NDX)              Value  4647.33 
Index  DJIA                           Value  17663.54 
Index  S&P 500                        Value  2079.36 
Index  Russell 2000                   Value  1161.86 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM