Scrapy：ValueError（'請求url中的丟失方案：％s'％self._url）

Question

我正在嘗試從網頁上抓取數據。 該網頁只是2500個URL的項目符號列表。 抓取抓取並轉到每個URL並抓取一些數據...

這是我的代碼

class MySpider(CrawlSpider):
    name = 'dknews'
    start_urls = ['http://www.example.org/uat-area/scrapy/all-news-listing']
    allowed_domains = ['example.org']

    def parse(self, response):
        hxs = Selector(response)
        soup = BeautifulSoup(response.body, 'lxml')
        nf = NewsFields()
        ptype = soup.find_all(attrs={"name":"dkpagetype"})
        ptitle = soup.find_all(attrs={"name":"dkpagetitle"})
        pturl = soup.find_all(attrs={"name":"dkpageurl"})
        ptdate = soup.find_all(attrs={"name":"dkpagedate"})
        ptdesc = soup.find_all(attrs={"name":"dkpagedescription"})
         for node in soup.find_all("div", class_="module_content-panel-sidebar-content"):
           ptbody = ''.join(node.find_all(text=True))  
           ptbody = ' '.join(ptbody.split())
           nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
           nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
           nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
           nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
           nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
           nf['bodytext'] = ptbody.encode('ascii', 'ignore')
         yield nf
            for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract():
             yield Request(url, callback=self.parse)

現在的問題是，上面的代碼在2500條文章中刮掉了大約215條。 通過給出此錯誤關閉...

ValueError（'請求網址中缺少方案：％s'％self._url）

我不知道是什么原因導致此錯誤....

任何幫助都非常感謝。

謝謝

Answer 1

更新01/2019

如今，Scrapy的Response實例具有非常方便的方法response.follow ，該方法使用response.url作為基礎，從給定的URL（絕對或相對或LinkExtractor生成的Link對象）中生成LinkExtractor ：

yield response.follow('some/url', callback=self.parse_some_url, headers=headers, ...)

文件： http : //doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response.follow

下面的代碼看起來像問題：

 for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract():
     yield Request(url, callback=self.parse)

如果任何網址都不完全合格，例如看起來像href="/path/to/page"而不是href="http://example.com/path/to/page" ，則會收到錯誤消息。 為了確保產生正確的請求，可以使用urljoin ：

    yield Request(response.urljoin(url), callback=self.parse)

cra草的方法是通過https://doc.scrapy.org/en/latest/topics/link-extractors.html使用LinkExtractor

Scrapy：ValueError（'請求url中的丟失方案：％s'％self._url）

問題描述

1 個解決方案

解決方案1
6 已采納 2017-02-03 15:21:45

Scrapy：ValueError（&#39;請求url中的丟失方案：％s&#39;％self._url）

問題描述

1 個解決方案

解決方案1 6 已采納 2017-02-03 15:21:45

Scrapy：ValueError（'請求url中的丟失方案：％s'％self._url）

解決方案1
6 已采納 2017-02-03 15:21:45