简体   繁体   English

使用scrapy刮网站

[英]Scrape websites using scrapy

I am trying to scrape a website with scrapy but I am having problem with scraping the all products from this site as it is using endless scrolling... 我正在尝试用scrapy刮一个网站 ,但是我在从这个网站上抓取所有产品时遇到问题,因为它正在使用无休止的滚动...

I can scrape only below data for 52 items only but their are 3824 items. 我只能抓下52件商品的数据,但他们是3824件商品。

hxs.select("//span[@class='itm-Catbrand strong']").extract()
hxs.select("//span[@class='itm-price ']").extract()
hxs.select("//span[@class='itm-title']").extract()

If I use hxs.select("//div[@id='content']/div/div/div").extract() Then it extracts whole items list but it won't filter further....How do I scrape all the items? 如果我使用hxs.select("//div[@id='content']/div/div/div").extract()然后它提取整个项目列表但不会进一步过滤....如何做我刮掉所有物品?

I have tried this but same result. 我试过这个但结果相同。 Where am I wrong? 我哪里错了?

def parse(self, response):
    filename = response.url.split("/")[-2]
    open(filename, 'wb').write(response.body
    for n in [2,3,4,5,6]:            
    req = Request(url="http://www.jabong.com/men/shoes/?page=" + n,
                      headers = {"Referer": "http://www.jabong.com/men/shoes/",
                                 "X-Requested-With": response.header['X-Requested-With']})
    return req 

As you have guessed, this website uses javascript to load more items when you scroll the page. 正如您所猜测的,当您滚动页面时,此网站使用javascript加载更多项目。

Using the developers tools included in my browser (Ctrl-Maj i for chromium), I saw in the Network tab that the javascript script included in the page performs the following requests to load more items : 使用浏览器中包含的开发人员工具(Ctrl-Maj i for chromium),我在网络选项卡中看到页面中包含的javascript脚本执行以下请求以加载更多项目:

GET http://www.website-your-are-crawling.com/men/shoes/?page=2 # 2,3,4,5,6 etc...

The web server responds with documents of the following type : Web服务器使用以下类型的文档进行响应:

<li id="PH969SH70HPTINDFAS" class="itm hasOverlay unit size1of4 ">
  <div id="qa-quick-view-btn" class="quickviewZoom itm-quickview ui-buttonQuickview l-absolute pos-t" title="Quick View" data-url ="phosphorus-Black-Moccasins-233629.html" data-sku="PH969SH70HPTINDFAS" onClick="_gaq.push(['_trackEvent', 'BadgeQV','Shown','OFFER INSIDE']);">Quick view</div>

                                    <div class="itm-qlInsert tooltip-qlist  highlightStar"
                     onclick="javascript:Rocket.QuickList.insert('PH969SH70HPTINDFAS', 'catalog');
                                             return false;" >
                                              <div class="starHrMsg">
                         <span class="starHrMsgArrow">&nbsp;</span>
                         Save for later                         </div>
                                        </div>
                <a id='cat_105_PH969SH70HPTINDFAS' class="itm-link sobrTxt" href="/phosphorus-Black-Moccasins-233629.html" 
                                    onclick="fireGaq('_trackEvent', 'Catalog to PDP', 'men--Shoes--Moccasins', 'PH969SH70HPTINDFAS--1699.00--', this),fireGaq('_trackEvent', 'BadgePDP','Shown','OFFER INSIDE', this);">
                    <span class="lazyImage">
                        <span style="width:176px;height:255px;" class="itm-imageWrapper itm-imageWrapper-PH969SH70HPTINDFAS" id="http://static4.jassets.com/p/Phosphorus-Black-Moccasins-6668-926332-1-catalog.jpg" itm-img-width="176" itm-img-height="255" itm-img-sprites="4">
                            <noscript><img src="http://static4.jassets.com/p/Phosphorus-Black-Moccasins-6668-926332-1-catalog.jpg" width="176" height="255" class="itm-img"></noscript>
                        </span>                            
                    </span>

                                            <span class="itm-budgeFlag offInside"><span class="flagBrdLeft"></span>OFFER INSIDE</span>                       
                                            <span class="itm-Catbrand strong">Phosphorus</span>
                    <span class="itm-title">
                                                                                Black Moccasins                        </span>

These documents contain more items. 这些文件包含更多项目。

So, to get the full list of items you will have to return Request objects in the parse method of your Spider (See the Spider class documentation ), to tell scrapy that it should load more data : 因此,要获得完整的项目列表,您必须在Spider的parse方法中返回Request对象(请参阅Spider类文档 ),告诉scrapy它应该加载更多数据:

def parse(self, response):
    # ... Extract items in the page using extractors
    n = number of the next "page" to parse
    # You get get n by using response.url, extracting the number
    # at the end and adding 1

    # It is VERY IMPORTANT to set the Referer and X-Requested-With headers
    # here because that's how the website detects if the request was made by javascript
    # or direcly by following a link.
    req = Request(url="http://www.website-your-are-crawling.com/men/shoes/?page=" + n,
       headers = {"Referer": "http://www.website-your-are-crawling.com/men/shoes/",
          "X-Requested-With": "XMLHttpRequest"})
    return req # and your items

Oh, and by the way (in case you want to test), you can't just load http://www.website-your-are-crawling.com/men/shoes/?page=2 in your browser to see what it returns because the website will redirect you to the global page (ie http://www.website-your-are-crawling.com/men/shoes/ ) if the X-Requested-With header is different from XMLHttpRequest . 哦,顺便说一句(如果你想测试),你不能只在浏览器中加载http://www.website-your-are-crawling.com/men/shoes/?page=2看看返回的内容是因为如果X-Requested-With标头与XMLHttpRequest不同,网站会将您重定向到全局页面(即http://www.website-your-are-crawling.com/men/shoes/ )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM