简体   繁体   English

python scrapy忽略start_url

[英]python scrapy ignoring start_url

i am trying to scrap a website but the start_url i am using is not crawled.我正在尝试废弃一个网站,但我正在使用的 start_url 未被抓取。 I put in the line print(response) to see what was happening and the output i get is (it seems that the start_url is ignored in favour of https://www.purplebricks.com/search?ref=header ):我输入了一行print(response)以查看发生了什么,我得到的输出是(似乎忽略了 start_url 以支持https://www.purplebricks.com/search?ref=header ):

c:\Users\andrew\Documents\Big Data Project\Data Collectors\PurpleBricks\Purplebricks>scrapy crawl purplebricks
2017-02-05 22:18:31 [scrapy] INFO: Scrapy 1.0.5 started (bot: Purplebricks)
2017-02-05 22:18:31 [scrapy] INFO: Optional features available: ssl, http11
2017-02-05 22:18:31 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'Purplebricks.spiders', 'SPIDER_MODULES': ['Purplebricks.spiders'], 'BOT_NAME': 'Purplebricks'}
2017-02-05 22:18:31 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2017-02-05 22:18:31 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-02-05 22:18:31 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-02-05 22:18:31 [scrapy] INFO: Enabled item pipelines:
2017-02-05 22:18:31 [scrapy] INFO: Spider opened
2017-02-05 22:18:31 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-02-05 22:18:31 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-02-05 22:18:32 [scrapy] DEBUG: Crawled (200) <GET https://www.purplebricks.com/search?ref=header#?q&location=se3&page=1&latitude=51.4688273310245&longitude=0.0176656312203414&searchType=ForSale&sortBy=1> (referer: None)
<200 https://www.purplebricks.com/search?ref=header>
2017-02-05 22:18:32 [scrapy] INFO: Closing spider (finished)
2017-02-05 22:18:32 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 235,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 13413,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 2, 5, 22, 18, 32, 326000),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 2, 5, 22, 18, 31, 913000)}
2017-02-05 22:18:32 [scrapy] INFO: Spider closed (finished)

the code i am using is below.我正在使用的代码如下。 I can sort of get it to work with selenium though i need to force the driver to pull up the webpage - even with selenium it will not open the start_url我可以让它与 selenium 一起工作,尽管我需要强制驱动程序打开网页 - 即使使用 selenium 它也不会打开 start_url

import scrapy

from Purplebricks.items import PurplebricksItem

class purplebricksSpider(scrapy.Spider):
    name = "purplebricks"
    allowed_domains = ["purplebricks.com"]
    start_urls = ["https://www.purplebricks.com/search?ref=header#?q&location=se3&page=1&latitude=51.4688273310245&longitude=0.0176656312203414&searchType=ForSale&sortBy=1",
                  ]

    def parse(self, response):
        print(response)
        for sel in response.xpath('//*[@class="row properties"]'):
            item = PurplebricksItem()
            prices = sel.xpath('//*/p/span[@data-bind="formatCurrency: property.marketPrice, roundDecimalsTo: 0"]/text()').extract()
            prices = [price.strip() for price in prices]
            property_ids = sel.xpath('//*[@class="title"]/div[@class="primary"]/a/@href').re(r'(?<=property-for-sale\/)(.*?)(?=\/#)')
            property_ids = [property_id.strip() for property_id in property_ids]
            addresses = sel.xpath('//*[@class="title"]/div[@class="primary"]/a/p[@class="type"]/text()').extract()
            addresses = [address.strip() for address in addresses]
            descriptions = sel.xpath('//*[@class="title"]/div[@class="primary"]/a/p[@class="address"]/text()').extract()
            descriptions = [description.strip() for description in descriptions]

            result = zip(prices, property_ids, addresses, descriptions)
            for price, property_id, address, description in result:
                item['price'] = price
                item['property_id'] = property_id
                item['description'] = description
                item['address'] = address
                yield item

any views as to what is occurring here?对这里发生的事情有什么看法吗?

Remove # from url.从 url 中删除# It has special meaning.它有特殊的意义。 Normally with chars on right side it describes place on current page but scrapy doesn't check what is on rigth size of # and strip all.通常在右侧使用字符来描述当前页面上的位置,但scrapy不会检查#大小并全部删除。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM