简体   繁体   English

如何使用不变的URL刮取多个页面 - python

[英]How to scrape multiple pages with an unchanging URL - python

I'm trying to scrape this website: http://data.eastmoney.com/xg/xg/ 我正试图抓住这个网站: http//data.eastmoney.com/xg/xg/

So far I've used selenium to execute the javascript and get the table scraped. 到目前为止,我已经使用selenium来执行javascript并获取表格。 However, my code right now only gets me the first page. 但是,我的代码现在只获得第一页。 I was wondering if there's a way to access the other 17 pages, because when I click on next page the URL does not change, so I cannot just iterate over a different URL each time 我想知道是否有办法访问其他17个页面,因为当我点击下一页时,URL不会改变,所以我不能每次都迭代不同的URL

Below is my code so far: 以下是我目前的代码:

from selenium import webdriver
import lxml
from bs4 import BeautifulSoup
import time

def scrape():
    url = 'http://data.eastmoney.com/xg/xg/'
    d={}
    f = open('east.txt','a')
    driver = webdriver.PhantomJS()
    driver.get(url)
    lst = [x for x in range(0,25)]
    htmlsource = driver.page_source
    bs = BeautifulSoup(htmlsource)
    heading = bs.find_all('thead')[0]
    hlist = []
    for header in heading.find_all('tr'):
        head = header.find_all('th')
    for i in lst:
        if i!=2:
            hlist.append(head[i].get_text().strip())
    h = '|'.join(hlist)
    print h
    table = bs.find_all('tbody')[0]
    for row in table.find_all('tr'):
        cells = row.find_all('td')
        d[cells[0].get_text()]=[y.get_text() for y in cells]
    for key in d:
        ret=[]
        for i in lst:
            if i != 2:
                ret.append(d.get(key)[i])
        s = '|'.join(ret)
        print s     

if __name__ == "__main__":  
    scrape()

Or is it possible for me to click next through the browser if I use webdriver.Chrome() instead of PhantomJS and then the Python run on the new page, after I click each time? 或者我可以在浏览器中单击下一步,如果我使用webdriver.Chrome()而不是PhantomJS,然后在每次单击后点击Python在新页面上运行?

This is not a trivial page to interact with and would require the use of Explicit Waits to wait for invisibility of "loading" indicators. 这不是一个与之交互的简单页面,需要使用显式等待来等待“加载”指标的隐形。

Here is the complete and working implementation that you may use as a starting point: 以下是您可以用作起点的完整且有效的实现:

# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from selenium import webdriver
import time

url = "http://data.eastmoney.com/xg/xg/"
driver = webdriver.PhantomJS()
driver.get(url)

def get_table_results(driver):
    for row in driver.find_elements_by_css_selector("table#dt_1 tr[class]"):
        print [cell.text for cell in row.find_elements_by_tag_name("td")]


# initial wait for results
WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//th[. = '加载中......']")))


while True:
    # print current page number
    page_number = driver.find_element_by_id("gopage").get_attribute("value")
    print "Page #" + page_number

    get_table_results(driver)

    next_link = driver.find_element_by_link_text("下一页")
    if "nolink" in next_link.get_attribute("class"):
        break

    next_link.click()
    time.sleep(2)  # TODO: fix?

    # wait for results to load
    WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//img[contains(@src, 'loading')]")))

    print "------"

The idea is to have an endless loop which we would exit only if the "Next Page" link becomes disabled (no more pages available). 我们的想法是拥有一个无限循环,只有当“下一页”链接被禁用时才会退出(不再有可用的页面)。 On every iteration, get the table results (printing on the console for the sake of an example), click the next link and wait for invisibility of the "loading" spinning circle appearing on top of the grid. 在每次迭代中,获取表格结果(为了示例而在控制台上打印),单击下一个链接并等待出现在网格顶部的“加载”旋转圆的不可见性。

I found another way to do this in C# using Chromedriver and Selenium. 我在C#中使用Chromedriver和Selenium找到了另一种方法。 All you have to do is add selenium references to the code and put chromedriver.exe references. 您所要做的就是在代码中添加selenium引用并放入chromedriver.exe引用。

In your code you can navigate to the url using 在您的代码中,您可以使用导航到网址

using (var driver = new chromedriver())
{
    driver.Navigate().GoToUrl(pathofurl);
    //find your element by using FindElementByXpath
    //var element = driver.FindElementByXpath(--Xpath--).Text;
}

Finding Xpath is easy - all you have to do is download scraper extension or x-path extension in chrome by going to chrome store. 查找Xpath非常简单 - 您只需通过转到chrome商店下载chrome中的scraper扩展或x-path扩展即可。 once you get a hang of x-path for elements you can find x-path for next button and use it in your code to navigate through pages very easily in a loop. 一旦你获得了元素的x-path,你可以找到下一个按钮的x-path,并在你的代码中使用它在循环中非常容易地浏览页面。 Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用不变的URL刮取多个页面 - Python 3 - How to scrape multiple pages with an unchanging URL - Python 3 使用Python无法使用不变的网址抓取多个页面 - Unable to scrape multiple pages with an unchanging URL with Python 如何使用不变的网址以网络方式刮擦具有不同城市的多个页面-Python 3 - How to web scrape multiple pages with different cities in unchanging URL - Python 3 如何使用 Python/Beautiful Soup 从 URL 不变的网页的多个页面中抓取超链接? - How to scrape hyperlinks from multiple pages on webpage with unchanging URL, using Python/Beautiful Soup? 如何使用带有请求的不变 URL 抓取特定页面 - How to scrape specific pages with an unchanging URL with requests 通过单击下一步来抓取具有不变网址的多个网页 - scrape multiple web pages with unchanging url by clicking next 带有未更改的URL和JS链接的Python抓取页面 - Python scrape page with unchanging URL and JS links 使用 BeautifulSoup 使用不变的 URL 抓取多个页面 - Scraping multiple pages with an unchanging URL using BeautifulSoup 如何使用 Python 抓取具有相同 URL 的 HTML 表中的多个页面? - How to scrape multiple pages in HTML table with same URL with Python? 无法获取来自多个页面的所有链接不变的URL - Unable to get all links from multiple pages unchanging url
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM