簡體   English   中英

scrapy LinkExtractor無法提取正確的網址

[英]scrapy LinkExtractor doesn't extract correct url

我正在使用Scrapy爬行網站。 我的start_url是包含許多頁面的搜索結果。 當我使用LinkExtractor時,它將向我想要的URL中添加更多內容。 因此我只能抓取start_url,所有其他受污染的url都將得到404。

2015-12-15 20:38:43 [scrapy] INFO: Spider opened
2015-12-15 20:38:43 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped     0 items (at 0 items/min)
2015-12-15 20:38:43 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-15 20:38:44 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: None)
2015-12-15 20:38:50 [scrapy] DEBUG: Crawled (404) <GET http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20http://task.zhubajie.com/success/p2.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93++++++++++++++++> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-15 20:38:50 [scrapy] DEBUG: Ignoring response <404 http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20http://task.zhubajie.com/success/p2.htmlkw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93++++++++++++++++>: HTTP status code is not handled or not allowed
...
2015-12-15 20:39:18 [scrapy] INFO: Closing spider (finished)
2015-12-15 20:39:18 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2578,
 'downloader/request_count': 6,
 'downloader/request_method_count/GET': 6,
 'downloader/response_bytes': 57627,
 'downloader/response_count': 6,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 5,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 12, 15, 12, 39, 18, 70000),
 'log_count/DEBUG': 12,
 'log_count/INFO': 7,
 'log_count/WARNING': 2,
 'request_depth_max': 1,
 'response_received_count': 6,
 'scheduler/dequeued': 6,
 'scheduler/dequeued/memory': 6,
 'scheduler/enqueued': 6,
 'scheduler/enqueued/memory': 6,
 'start_time': datetime.datetime(2015, 12, 15, 12, 38, 43, 693000)}
2015-12-15 20:39:18 [scrapy] INFO: Spider closed (finished)

我想要

http://task.zhubajie.com/success/p2.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93 

以外:

http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20http://task.zhubajie.com/success/p2.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93++++++++++++++++". 

我不知道是什么原因造成的。 有人可以幫助我嗎?

start_urls = [
    'http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93',
]

rules = [
    #Rule(LinkExtractor(allow=(r'task.zhubajie.com/success/p\d+\.html',), callback='parse_item', follow=True),
    Rule(LinkExtractor(restrict_xpaths=('//div[@class="pagination"]')), callback='parse_item', follow=True)
]

編輯:我試圖使用這樣的process_value。

 Rule(LinkExtractor(restrict_xpaths=('//div[@class="pagination"]'),     process_value=lambda x: x.strip()), callback='parse_item', follow=True)

和這個:

    def process_0(value):
      m = re.search('http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20', value)
      if m:
        return m.strip('http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20')

他們不工作,也不行。 他們兩個都有相同的日志,並訪問錯誤的URL。

分頁器中的所有鏈接都有很多空白http://screencloud.net/v/qQLW 您可以使用以下代碼對報廢的值進行預處理,然后再獲得結果:

# coding: utf-8
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


def process_value(v):
    v1 = v.split()[-1]
    if v1.startswith('http'):
        v = v1
    return v


class MySpider(CrawlSpider):
    name = 'spider'
    start_urls = [
        'http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93'
    ]
    rules = [
        Rule(LinkExtractor(restrict_xpaths=('//div[@class="pagination"]'),
                           process_value=process_value), follow=True)
    ]

蜘蛛輸出:

2015-12-18 10:35:37 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-12-18 10:35:37 [scrapy] INFO: Optional features available: ssl, http11
2015-12-18 10:35:37 [scrapy] INFO: Overridden settings: {}
2015-12-18 10:35:37 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-12-18 10:35:37 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-12-18 10:35:37 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-12-18 10:35:37 [scrapy] INFO: Enabled item pipelines: 
2015-12-18 10:35:37 [scrapy] INFO: Spider opened
2015-12-18 10:35:37 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-18 10:35:37 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-18 10:35:38 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: None)
2015-12-18 10:35:40 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/p4.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:40 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/p6.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:40 [scrapy] DEBUG: Filtered duplicate request: <GET http://task.zhubajie.com/success/p3.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2015-12-18 10:35:41 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/p3.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:41 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:47 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/p5.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:54 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/p2.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:54 [scrapy] INFO: Closing spider (finished)
2015-12-18 10:35:54 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2380,
 'downloader/request_count': 7,
 'downloader/request_method_count/GET': 7,
 'downloader/response_bytes': 196525,
 'downloader/response_count': 7,
 'downloader/response_status_count/200': 7,
 'dupefilter/filtered': 36,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 12, 18, 7, 35, 54, 945271),
 'log_count/DEBUG': 9,
 'log_count/INFO': 7,
 'request_depth_max': 2,
 'response_received_count': 7,
 'scheduler/dequeued': 7,
 'scheduler/dequeued/memory': 7,
 'scheduler/enqueued': 7,
 'scheduler/enqueued/memory': 7,
 'start_time': datetime.datetime(2015, 12, 18, 7, 35, 37, 907281)}
2015-12-18 10:35:54 [scrapy] INFO: Spider closed (finished)

LinkExtractor文檔

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM