简体   繁体   English

如何使用selenium和Scrapy从Flipkart等动态网站中提取数据?

[英]How to extract data from dynamic websites like Flipkart using selenium and Scrapy?

As Flipkart.com shows only 15 to 20 results on 1st page and when scrolled it shows more results. 由于Flipkart.com在第一页上仅显示15到20个结果,因此滚动显示更多结果。 Scrapy extracts results of 1st page successfully but not of next pages. Scrapy成功提取第1页的结果,但不提取下一页的结果。 I tried using Selenium for it, but couldn't find success. 我尝试使用Selenium,但找不到成功。 Here is my code :- 这是我的代码: -

from scrapy.spider import Spider

from scrapy.selector import Selector

from flipkart.items import FlipkartItem

from scrapy.spider import BaseSpider

from selenium import webdriver

class FlipkartSpider(BaseSpider):
    name = "flip1"
    allowed_domains = ["flipkart.com"]
    start_urls = [
        "http://www.flipkart.com/beauty-and-personal-care/personal-care-appliances/hair-dryers/pr?sid=t06,79s,mh8&otracker=nmenu_sub_electronics_0_Hair%20Dryers"
]

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        sel = Selector(response)
        self.driver.get(response.url)
        while True:
            next = self.driver.find_element_by_xpath('//div[@id="show-more-results"]')
            try:

                sites = sel.select('//div[@class="gd-col gu12 browse-product fk-inf-scroll-item"] | //div[@class="pu-details lastUnit"]')
                for site in sites:
                    item = FlipkartItem()
                    item['title'] = site.select('div//a[@class="lu-title"]/text() | div[1]/a/text()').extract()
                    item['price'] = site.select('div//div[@class="pu-price"]/div/text() | div//div[@class="pu-final"]/span/text()').extract()
                    yield item
                next.wait_for_page_to_load("30")
            except:
                break
            self.driver.close()

,My items.py is:- ,我的items.py是: -

import scrapy
class FlipkartItem(scrapy.Item):
    title=scrapy.Field()
    price=scrapy.Field()

and Following output I get is only for 15 items:- 以下输出我只得到15项: -

[{"price": ["Rs. 599"], "title": ["\n Citron Elegant 1400 W HD001 Hair Dryer (Pink)\n "]},

{"price": ["Rs. 799"], "title": ["\n Citron Vogue 1800 W HD002 Hair Dryer (White)\n "]},

{"price": ["Rs. 645"], "title": ["\n Philips HP8100/00 Hair Dryer (Blue)\n "]},

{"price": ["Rs. 944"], "title": ["\n Philips HP8111/00 Hair Dryer\n "]},

{"price": ["Rs. 171"], "title": ["\n Nova Professional With 2 Speed NV-1290 Hair Dryer (Pink...\n "]},

{"price": ["Rs. 175"], "title": ["\n Nova NHD 2840 Hair Dryer\n "]},

{"price": ["Rs. 775"], "title": ["\n Philips HP 8112 Hair Dryer\n "]},

{"price": ["Rs. 1,925"], "title": ["\n Philips HP8643/00 Miss Fresher's Pack Hair Straightener...\n "]},

{"price": ["Rs. 144"], "title": ["\n Nova Foldable N-658 Hair Dryer (White, Pink)\n "]},

{"price": ["Rs. 1,055"], "title": ["\n Philips HP8100/46 Hair Dryer\n "]},

{"price": ["Rs. 849"], "title": ["\n Panasonic EH-ND12-P62B Hair Dryer (Pink)\n "]},

{"price": ["Rs. 760"], "title": ["\n Panasonic EH-ND11 Hair Dryer (White)\n "]},

{"price": ["Rs. 1,049"], "title": ["\n Panasonic EH-ND13-V Hair Dryer (Violet)\n "]},

{"price": ["Rs. 1,554"], "title": ["\n Philips 1600 W HP4940 Hair Dryer (White & Light Pink)\n "]},

{"price": ["Rs. 2,008"], "title": ["\n Philips Kerashine HP8216/00 Hair Dryer\n "]}]

You have to force the webdriver loading of more results. 你必须强制webdriver加载更多的结果。 In order to be able to interact with the other results, the webdriver need to scroll the page until the elements appear. 为了能够与其他结果进行交互,webdriver需要滚动页面直到元素出现。

The code for scrolling is: 滚动代码是:

driver.execute_script("window.scrollTo(0, location.get('y')")

To decide where to scroll you can find an element in the lower part of the page (as example the footer) and keep scrolling to it.To get the coordinates of the element you can use the Webelement property location 要确定滚动的位置,您可以在页面的下半部分找到一个元素(例如页脚)并继续滚动到它。要获取元素的坐标,您可以使用Webelement属性位置

driver = webdriver.Firefox()
down = driver.find_element_by_xpath("//someXpath")
location = down.location

You can use Javascript to scroll down the page. 您可以使用Javascript向下滚动页面。

Following code will scroll the page down by 10000,10000 in x & y direction. 以下代码将在x和y方向上向下滚动页面10000,10000。 As 10000 is big number so it takes you to the bottom of page. 因为10000是大数字所以它会带你到页面底部。 Once you are at bottom , AJAX request is fired by flipkart to load more item. 一旦你到达底部,翻转卡片就会激活AJAX请求以加载更多项目。

window.scrollBy(10000,10000);

I am not sure how we can do that in scrapy but using selenium it is easy. 我不确定如何在scrapy中做到这一点,但使用硒很容易。

Here is code 这是代码

((JavascriptExecutor) driver).executeScript("window.scrollBy(10000,10000);");

I managed it differently.. See my code for further reference. 我以不同方式管理它..请参阅我的代码以供进一步参考 Working fine for complete site.. 完整的网站工作正常..

class FlipkartSpider(BaseSpider):
    name = "flip1"
    allowed_domains = ["flipkart.com"]
    start_urls = [
        "http://www.flipkart.com/tablets/pr?sid=tyy%2Chry&q=mobile&ref=b8b64676-065a-445c-a6a1-bc964d5ff938"
    ]
    '''def is_element_present(self, finder, selector, wait_time=None):
        wait_time = wait_time or self.wait_time
        end_time = time.time() + wait_time
        while time.time() < end_time:
            if finder(selector):
                return True
        return False
        def is_element_present_by_xpath(self, xpath, wait_time=None):
        return self.is_element_present(self.find_by_xpath, xpath, wait_time)
        '''
    def __init__(self):
        self.driver = webdriver.Firefox()


    def parse(self, response):
        sel = Selector(response) 
        self.driver.get(response.url)
        block="block"
        hyper="http://www.flipkart.com"
        print hyper
        #i=0
        while True:
            self.driver.execute_script("window.scrollTo(10000000,10000000)")
            self.driver.set_page_load_timeout(10000)
            try:
                show = self.driver.find_element_by_xpath('//div[@id="show-more-results"]').value_of_css_property('display')
                if show==block:
                    self.driver.find_element_by_xpath('//div[@id="show-more-results"]').click()
                no_more = self.driver.find_element_by_xpath('//*[@id="no-more-results" and @class="dont-show"]').value_of_css_property('display')
                if no_more==block:
                    break;
                time.sleep(5)
                self.driver.execute_script("window.scrollTo(10000000,10000000)")

                self.driver.set_page_load_timeout(10000)
                #if i==7:
                #   break
            except NoSuchElementException:
                print "pungi"
                break
        #down = self.driver.find_element_by_xpath('//div[@id="show-more-results"]')
        #location = down.location
        #self.((JavascriptExecutor) driver).executeScript("window.scrollBy(10000,10000);");
        #next = self.driver.find_element_by_xpath('//div[@id="show-more-results"]')
        response = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
        try:
            #self.driver.set_page_load_timeout(10000)
            #driver.execute_script("window.scrollTo(0, location.get('y')")
            sites = response.xpath('//div[@class="gd-col gu12 browse-product fk-inf-scroll-item"] | //div[@class="pu-details lastUnit"] |  //div[@class="pu-visual-section"]')
            for site in sites:
                item = FlipkartItem()
                item['title'] = site.xpath('div//a[@class="lu-title"]/text() | div[1]/a/text()').extract()
                item['price'] = site.xpath('div//div[@class="pu-price"]/div/text() | div//div[@class="pu-final"]/span/text()').extract()
                item['rating'] = site.xpath('div[@class="pu-rating"]/div/@title').extract()
                item['image'] = site.xpath('a/img/@src').extract()
                data = site.xpath('a/@href').extract()
                print data
                item['link'] = data

                #print rating
                yield item
            '''for site in sites:
                item = FlipkartItem()
                item['title'] = site.xpath('div//a[@class="lu-title"]/text() | div[1]/a/text()').extract()
                item['price'] = site.xpath('div//div[@class="pu-price"]/div/text() | div//div[@class="pu-final"]/span/text()').extract()
                item['rating'] = site.xpath('div[@class="pu-rating"]/div/@title').extract()
                #print rating
                yield item'''
            #next.click()
            #self.driver.execute_script("window.scrollTo(10000000,10000000)")
        except:
            #break
            a=10
        self.driver.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Selenium 有效地从动态网站中抓取数据? - How to efficiently scrap data from dynamic websites using Selenium? 如何使用硒从网站中提取所有动态表数据? - How to extract all dynamic table data from a website using selenium? 从动态表中抓取数据 - Scrapy extract data from dynamic table scrapy 可以用来从使用 AJAX 的网站上抓取动态内容吗? - Can scrapy be used to scrape dynamic content from websites that are using AJAX? 如何使用scrapy或selenium刮动态页面? - How to scrape a dynamic page using scrapy or selenium? 如何使用 selenium 从 notam 中提取数据 - How to to extract data from notam using selenium 如何使用re()使用scrapy从javascript变量中提取数据? - How to use re() to extract data from javascript variable using scrapy? 如何使用scrapy从网页内容中提取动态数据的网页正文中的数据 - How to extract data from a webpage's body where we have dynamic google ads in between the content using scrapy 如何在Python中使用Selenium从具有隐藏元素的动态折叠表中提取数据 - How to extract data from dynamic collapsing table with hidden elements using Selenium in Python 如何使用 selenium python 从动态表中提取数据? - How to extract data from a dynamic table with selenium python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM