简体   繁体   中英

Scrapy DEBUG: Crawled (200)

I'm trying to scrape a webpage using Scrapy and XPath selectors. I've tested my XPath selectors using chrome. It seems my spider crawls zero pages and scrapes 0 items. What can I do to correct it? I get the following output from crawling:

$ scrapy crawl stack
2015-08-24 21:11:55 [scrapy] INFO: Scrapy 1.0.3 started (bot: stack)
2015-08-24 21:11:55 [scrapy] INFO: Optional features available: ssl, http11
2015-08-24 21:11:55 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'st
ack.spiders', 'SPIDER_MODULES': ['stack.spiders'], 'BOT_NAME': 'stack'}
2015-08-24 21:11:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-08-24 21:11:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-24 21:11:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-24 21:11:56 [scrapy] INFO: Enabled item pipelines:
2015-08-24 21:11:56 [scrapy] INFO: Spider opened
2015-08-24 21:11:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-08-24 21:11:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-08-24 21:11:56 [scrapy] DEBUG: Crawled (200) <GET http://www.cofman.com/search.php?country=DK#areaid=100001&areatxt=Danmark&country=DK&zoom=6&startDate=2015-08-29&endDate=2015-09-05&fuzzy=false> (referer: None)
2015-08-24 21:11:56 [scrapy] DEBUG: Scraped from <200 http://www.cofman.com/search.php?country=DK>
{'by': [], 'husnr': [], 'periode': [], 'pris': []}
2015-08-24 21:11:56 [scrapy] INFO: Closing spider (finished)
2015-08-24 21:11:56 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 233,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 6059,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 8, 24, 19, 11, 56, 875000),
 'item_scraped_count': 1,
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2015, 8, 24, 19, 11, 56, 390000)}
2015-08-24 21:11:56 [scrapy] INFO: Spider closed (finished)

This is my spider:

from scrapy import Spider
from scrapy.selector import Selector

from stack.items import StackItem


class StackSpider(Spider):
    name = "stack"
    allowed_domains = ["cofman.com"]
    start_urls = [
    "http://www.cofman.com/search.php?country=DK#areaid=100001&areatxt=Danmark&country=DK&zoom=6&startDate=2015-08-29&endDate=2015-09-05&fuzzy=false",
    ]

    def parse(self, response):
        questions = Selector(response).xpath('//*[@id="content"]/div[4]')

        for question in questions:
            item = StackItem()
            item['husnr'] = question.xpath(
            '//*[@id="resultListning"]/div/div/div[1]/a/small').extract()
            item['pris'] = question.xpath(
            '//*[@id="resultListning"]/div/div/div[5]/div/div[1]//*/span[@class="formatted_price"]').extract()
            item['by'] = question.xpath(
            '//*[@id="resultListning"]/div/div/div[1]/a/text()').extract()
            item['periode'] = question.xpath(
            '//*[@id="mapNavigation"]/table/tbody/tr/td[1]/div/text()').extract()
            yield item

And lastly my items.py:

from scrapy.item import Item, Field


class StackItem(Item):
    husnr = Field()
    pris = Field()
    by = Field()
    periode = Field()

Scrapy is working fine. However, the page you are trying to scrape fetches its content via Javascript. Scrapy isn't ever getting the content you want to scrape.

>>> Selector(response).xpath('//div[@id="resultListning"]').extract()
[u'<div id="resultListning"></div>']

You'll need to either find out where it's grabbing the data from and grab it from that source, or you'll need to use any of the various methods of rendering JS.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM