Scraping 僅通過 Scrapy 獲取 Python 中的第一條記錄

Question

事實上，我一般是scrapy和python的新手。 這是我第一次嘗試抓取網站

import scrapy

class HamburgSpider(scrapy.Spider):
    name = 'hamburg'
    allowed_domains = ['https://www.hamburg.de']
    start_urls = ['https://www.hamburg.de/branchenbuch/hamburg/10239785/n0/']
    custom_settings = {
        'FEED_EXPORT_FORMAT': 'utf-8'
    }

    def parse(self, response):
        items = response.xpath("//div[starts-with(@class, 'item')]")
        for item in items:
            business_name = item.xpath(".//h3[@class='h3rb']/text()").get()
            address1 = item.xpath(".//div[@class='address']/p[@class='extra post']/text()[1]").get()
            address2 = item.xpath(".//div[@class='address']/p[@class='extra post']/text()[2]").get()
            phone = item.xpath(".//div[@class='address']/span[@class='extra phone']/text()").get()

            yield {
                'Business Name': business_name,
                'Address1': address1,
                'Address2': address2,
                'Phone Number': phone
            }
        
        next_page_url = 'https://www.hamburg.de' + response.xpath("//li[@class='next']/a/@href").get()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(url=next_page_url, callback=self.parse)

該代碼有效，但在我正在抓取的頁面中，我有 20 條記錄。 該代碼抓取了 20 條記錄，但都是針對第一條記錄的。 代碼 deosn 沒有得到 20 條記錄可能代碼有點錯誤，但我到現在都找不到

** 至於在 for 塊中的分頁我把這個但也沒有工作

next_page_url = response.xpath("//li[@class='next']/@href").get()
if next_page_url:
    next_page_url = response.urljoin(next_page_url)
    yield scrapy.Request(url=next_page_url, callback=self.parse)

這些是調試的結果

{'Business Name': ' A & Z Kfz Meisterbetrieb GmbH ', 'Address1': ' Anckelmannstraße 13', 'Address2': ' 20537 Hamburg (Borgfelde) ', 'Phone Number': '040 / 236 882 10 '}
2020-11-10 19:55:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.hamburg.de/branchenbuch/hamburg/10239785/n0/>
{'Business Name': ' A+B Automobile ', 'Address1': ' Kuehnstraße 19', 'Address2': ' 22045 Hamburg (Tonndorf) ', 'Phone Number': '040 / 696 488-0 '}
2020-11-10 19:55:10 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.hamburg.de': <GET https://www.hamburg.de/branchenbuch/hamburg/10239785/n20/>
2020-11-10 19:55:10 [scrapy.core.engine] INFO: Closing spider (finished)
2020-11-10 19:55:10 [scrapy.extensions.feedexport] INFO: Stored json feed (20 items) in: output.json
2020-11-10 19:55:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 247,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 50773,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 2.222001,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 11, 10, 17, 55, 10, 908399),
 'item_scraped_count': 20,
 'log_count/DEBUG': 22,
 'log_count/INFO': 11,
 'log_count/WARNING': 1,
 'offsite/domains': 1,
 'offsite/filtered': 1,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 11, 10, 17, 55, 8, 686398)}
2020-11-10 19:55:10 [scrapy.core.engine] INFO: Spider closed (finished)

Answer 1

這是問題所在

item.xpath("//h3[@class='h3rb']/text()").get()

當我們想在scrapy中訪問嵌套選擇器時，我們必須使用(".//")而不是("//") 。 嘗試按如下方式更改您的代碼

business_name = item.xpath(".//h3[@class='h3rb']/text()").get()
address1 = item.xpath(".//div[@class='address']/p[@class='extra post']/text()[1]").get()
address2 = item.xpath(".//div[@class='address']/p[@class='extra post']/text()[2]").get()
phone = item.xpath(".//div[@class='address']/span[@class='extra phone']/text()").get()

希望它按您的意願工作。

Scraping 僅通過 Scrapy 獲取 Python 中的第一條記錄

問題描述

1 個解決方案

解決方案1
1 已采納 2020-11-10 16:45:18

Scraping 僅通過 Scrapy 獲取 Python 中的第一條記錄

問題描述

1 個解決方案

解決方案1 1 已采納 2020-11-10 16:45:18

解決方案1
1 已采納 2020-11-10 16:45:18