如何使用不变的URL刮取多个页面 - python

Question

I'm trying to scrape this website: http://data.eastmoney.com/xg/xg/ 我正试图抓住这个网站： http ： //data.eastmoney.com/xg/xg/

So far I've used selenium to execute the javascript and get the table scraped. 到目前为止，我已经使用selenium来执行javascript并获取表格。 However, my code right now only gets me the first page. 但是，我的代码现在只获得第一页。 I was wondering if there's a way to access the other 17 pages, because when I click on next page the URL does not change, so I cannot just iterate over a different URL each time 我想知道是否有办法访问其他17个页面，因为当我点击下一页时，URL不会改变，所以我不能每次都迭代不同的URL

Below is my code so far: 以下是我目前的代码：

from selenium import webdriver
import lxml
from bs4 import BeautifulSoup
import time

def scrape():
    url = 'http://data.eastmoney.com/xg/xg/'
    d={}
    f = open('east.txt','a')
    driver = webdriver.PhantomJS()
    driver.get(url)
    lst = [x for x in range(0,25)]
    htmlsource = driver.page_source
    bs = BeautifulSoup(htmlsource)
    heading = bs.find_all('thead')[0]
    hlist = []
    for header in heading.find_all('tr'):
        head = header.find_all('th')
    for i in lst:
        if i!=2:
            hlist.append(head[i].get_text().strip())
    h = '|'.join(hlist)
    print h
    table = bs.find_all('tbody')[0]
    for row in table.find_all('tr'):
        cells = row.find_all('td')
        d[cells[0].get_text()]=[y.get_text() for y in cells]
    for key in d:
        ret=[]
        for i in lst:
            if i != 2:
                ret.append(d.get(key)[i])
        s = '|'.join(ret)
        print s     

if __name__ == "__main__":  
    scrape()

Or is it possible for me to click next through the browser if I use webdriver.Chrome() instead of PhantomJS and then the Python run on the new page, after I click each time? 或者我可以在浏览器中单击下一步，如果我使用webdriver.Chrome（）而不是PhantomJS，然后在每次单击后点击Python在新页面上运行？

Answer 1

This is not a trivial page to interact with and would require the use of Explicit Waits to wait for invisibility of "loading" indicators. 这不是一个与之交互的简单页面，需要使用显式等待来等待“加载”指标的隐形。

Here is the complete and working implementation that you may use as a starting point: 以下是您可以用作起点的完整且有效的实现：

# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from selenium import webdriver
import time

url = "http://data.eastmoney.com/xg/xg/"
driver = webdriver.PhantomJS()
driver.get(url)

def get_table_results(driver):
    for row in driver.find_elements_by_css_selector("table#dt_1 tr[class]"):
        print [cell.text for cell in row.find_elements_by_tag_name("td")]


# initial wait for results
WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//th[. = '加载中......']")))


while True:
    # print current page number
    page_number = driver.find_element_by_id("gopage").get_attribute("value")
    print "Page #" + page_number

    get_table_results(driver)

    next_link = driver.find_element_by_link_text("下一页")
    if "nolink" in next_link.get_attribute("class"):
        break

    next_link.click()
    time.sleep(2)  # TODO: fix?

    # wait for results to load
    WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//img[contains(@src, 'loading')]")))

    print "------"

The idea is to have an endless loop which we would exit only if the "Next Page" link becomes disabled (no more pages available). 我们的想法是拥有一个无限循环，只有当“下一页”链接被禁用时才会退出（不再有可用的页面）。 On every iteration, get the table results (printing on the console for the sake of an example), click the next link and wait for invisibility of the "loading" spinning circle appearing on top of the grid. 在每次迭代中，获取表格结果（为了示例而在控制台上打印），单击下一个链接并等待出现在网格顶部的“加载”旋转圆的不可见性。

Answer 2

I found another way to do this in C# using Chromedriver and Selenium. 我在C＃中使用Chromedriver和Selenium找到了另一种方法。 All you have to do is add selenium references to the code and put chromedriver.exe references. 您所要做的就是在代码中添加selenium引用并放入chromedriver.exe引用。

In your code you can navigate to the url using 在您的代码中，您可以使用导航到网址

using (var driver = new chromedriver())
{
    driver.Navigate().GoToUrl(pathofurl);
    //find your element by using FindElementByXpath
    //var element = driver.FindElementByXpath(--Xpath--).Text;
}

Finding Xpath is easy - all you have to do is download scraper extension or x-path extension in chrome by going to chrome store. 查找Xpath非常简单 - 您只需通过转到chrome商店下载chrome中的scraper扩展或x-path扩展即可。 once you get a hang of x-path for elements you can find x-path for next button and use it in your code to navigate through pages very easily in a loop. 一旦你获得了元素的x-path，你可以找到下一个按钮的x-path，并在你的代码中使用它在循环中非常容易地浏览页面。 Hope this helps. 希望这可以帮助。

如何使用不变的URL刮取多个页面 - python

问题描述

2 个解决方案

解决方案1
4 已采纳 2015-07-10 14:40:10

解决方案2
0 2015-12-06 20:09:47

如何使用不变的URL刮取多个页面 - python

问题描述

2 个解决方案

解决方案1 4 已采纳 2015-07-10 14:40:10

解决方案2 0 2015-12-06 20:09:47

解决方案1
4 已采纳 2015-07-10 14:40:10

解决方案2
0 2015-12-06 20:09:47