简体   繁体   中英

How to scrape multiple pages with an unchanging URL - python

I'm trying to scrape this website: http://data.eastmoney.com/xg/xg/

So far I've used selenium to execute the javascript and get the table scraped. However, my code right now only gets me the first page. I was wondering if there's a way to access the other 17 pages, because when I click on next page the URL does not change, so I cannot just iterate over a different URL each time

Below is my code so far:

from selenium import webdriver
import lxml
from bs4 import BeautifulSoup
import time

def scrape():
    url = 'http://data.eastmoney.com/xg/xg/'
    d={}
    f = open('east.txt','a')
    driver = webdriver.PhantomJS()
    driver.get(url)
    lst = [x for x in range(0,25)]
    htmlsource = driver.page_source
    bs = BeautifulSoup(htmlsource)
    heading = bs.find_all('thead')[0]
    hlist = []
    for header in heading.find_all('tr'):
        head = header.find_all('th')
    for i in lst:
        if i!=2:
            hlist.append(head[i].get_text().strip())
    h = '|'.join(hlist)
    print h
    table = bs.find_all('tbody')[0]
    for row in table.find_all('tr'):
        cells = row.find_all('td')
        d[cells[0].get_text()]=[y.get_text() for y in cells]
    for key in d:
        ret=[]
        for i in lst:
            if i != 2:
                ret.append(d.get(key)[i])
        s = '|'.join(ret)
        print s     

if __name__ == "__main__":  
    scrape()

Or is it possible for me to click next through the browser if I use webdriver.Chrome() instead of PhantomJS and then the Python run on the new page, after I click each time?

This is not a trivial page to interact with and would require the use of Explicit Waits to wait for invisibility of "loading" indicators.

Here is the complete and working implementation that you may use as a starting point:

# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from selenium import webdriver
import time

url = "http://data.eastmoney.com/xg/xg/"
driver = webdriver.PhantomJS()
driver.get(url)

def get_table_results(driver):
    for row in driver.find_elements_by_css_selector("table#dt_1 tr[class]"):
        print [cell.text for cell in row.find_elements_by_tag_name("td")]


# initial wait for results
WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//th[. = '加载中......']")))


while True:
    # print current page number
    page_number = driver.find_element_by_id("gopage").get_attribute("value")
    print "Page #" + page_number

    get_table_results(driver)

    next_link = driver.find_element_by_link_text("下一页")
    if "nolink" in next_link.get_attribute("class"):
        break

    next_link.click()
    time.sleep(2)  # TODO: fix?

    # wait for results to load
    WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//img[contains(@src, 'loading')]")))

    print "------"

The idea is to have an endless loop which we would exit only if the "Next Page" link becomes disabled (no more pages available). On every iteration, get the table results (printing on the console for the sake of an example), click the next link and wait for invisibility of the "loading" spinning circle appearing on top of the grid.

I found another way to do this in C# using Chromedriver and Selenium. All you have to do is add selenium references to the code and put chromedriver.exe references.

In your code you can navigate to the url using

using (var driver = new chromedriver())
{
    driver.Navigate().GoToUrl(pathofurl);
    //find your element by using FindElementByXpath
    //var element = driver.FindElementByXpath(--Xpath--).Text;
}

Finding Xpath is easy - all you have to do is download scraper extension or x-path extension in chrome by going to chrome store. once you get a hang of x-path for elements you can find x-path for next button and use it in your code to navigate through pages very easily in a loop. Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM