簡體   English   中英

Scrapy.Request 返回<GET url>不刮任何東西

[英]Scrapy.Request returns <GET url> without scraping anything

我想抓取 sitepoint.com 的提要,這是我的代碼:

import scrapy
from urllib.parse import urljoin


class SitepointSpider(scrapy.Spider):
    # TODO: Add url tags (like /javascript) to the spider based on class paraneters
    name = "sitepoint"
    allowed_domains = ["sitepoint.com"]
    start_urls = ["http://sitepoint.com/javascript/"]

    def parse(self, response):
        data = []
        for article in response.css("article"):
            title = article.css("a.t12xxw3g::text").get()
            href = article.css("a.t12xxw3g::attr(href)").get()
            img = article.css("img.f13hvvvv::attr(src)").get()
            time = article.css("time::text").get()
            url = urljoin("https://sitepoint.com", href)
            text = scrapy.Request(url, callback=self.parse_article)

            data.append(
                {"title": title, "href": href, "img": img, "time": time, "text": text}
            )
        yield data

    def parse_article(self, response):
        text = response.xpath(
           '//*[@id="main-content"]/article/div/div/div[1]/section/text()'
        ).extract()
        yield text

這是我得到的回應:-

[{'title': 'How to Build an MVP with React and Firebase', 
'href': '/react-firebase-build-mvp/', 
'img': 'https://uploads.sitepoint.com/wp-content/uploads/2021/09/1632802723react-firebase-mvp- 
app.jpg', 
'time': 'September 28, 2021', 
'text': <GET https://sitepoint.com/react-firebase-build-mvp/>}]

它只是不抓取網址。 我遵循了這個問題中所說的一切,但仍然無法使其工作。

您必須訪問列表中的詳細信息頁面才能抓取文章。

在這種情況下,您必須先生成 URL,然后在最后一個蜘蛛中生成數據

此外, //*[@id="main-content"]/article/div/div/div[1]/section/text()不會返回任何文本,因為該section下有很多 HTML 元素標簽

一種解決方案是您可以抓取section標簽內的所有 HTML 元素並稍后清理它們以獲取您的文章文本數據

這是完整的工作代碼

import re

import scrapy
from urllib.parse import urljoin


class SitepointSpider(scrapy.Spider):
    # TODO: Add url tags (like /javascript) to the spider based on class paraneters
    name = "sitepoint"
    allowed_domains = ["sitepoint.com"]
    start_urls = ["http://sitepoint.com/javascript/"]

    def clean_text(self, raw_html):
        """
        :param raw_html: this will take raw html code
        :return: text without html tags
        """
        cleaner = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
        return re.sub(cleaner, '', raw_html)

    def parse(self, response):
        for article in response.css("article"):
            title = article.css("a.t12xxw3g::text").get()
            href = article.css("a.t12xxw3g::attr(href)").get()
            img = article.css("img.f13hvvvv::attr(src)").get()
            time = article.css("time::text").get()
            url = urljoin("https://sitepoint.com", href)
            yield scrapy.Request(url, callback=self.parse_article, meta={"title": title,
                                                                         "href": href,
                                                                         "img": img,
                                                                         "time": time})

    def parse_article(self, response):
        title = response.request.meta["title"]
        href = response.request.meta["href"]
        img = response.request.meta["img"]
        time = response.request.meta["time"]
        all_data = {}
        article_html = response.xpath('//*[@id="main-content"]/article/div/div/div[1]/section').get()
        all_data["title"] = title
        all_data["href"] = href
        all_data["img"] = img
        all_data["time"] = time
        all_data["text"] = self.clean_text(article_html)

        yield all_data

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM