简体   繁体   English

Scrapy.Request 返回<GET url>不刮任何东西

[英]Scrapy.Request returns <GET url> without scraping anything

I wanted to scrape the feed of sitepoint.com, this is my code:我想抓取 sitepoint.com 的提要,这是我的代码:

import scrapy
from urllib.parse import urljoin


class SitepointSpider(scrapy.Spider):
    # TODO: Add url tags (like /javascript) to the spider based on class paraneters
    name = "sitepoint"
    allowed_domains = ["sitepoint.com"]
    start_urls = ["http://sitepoint.com/javascript/"]

    def parse(self, response):
        data = []
        for article in response.css("article"):
            title = article.css("a.t12xxw3g::text").get()
            href = article.css("a.t12xxw3g::attr(href)").get()
            img = article.css("img.f13hvvvv::attr(src)").get()
            time = article.css("time::text").get()
            url = urljoin("https://sitepoint.com", href)
            text = scrapy.Request(url, callback=self.parse_article)

            data.append(
                {"title": title, "href": href, "img": img, "time": time, "text": text}
            )
        yield data

    def parse_article(self, response):
        text = response.xpath(
           '//*[@id="main-content"]/article/div/div/div[1]/section/text()'
        ).extract()
        yield text

And this is the response I get:-这是我得到的回应:-

[{'title': 'How to Build an MVP with React and Firebase', 
'href': '/react-firebase-build-mvp/', 
'img': 'https://uploads.sitepoint.com/wp-content/uploads/2021/09/1632802723react-firebase-mvp- 
app.jpg', 
'time': 'September 28, 2021', 
'text': <GET https://sitepoint.com/react-firebase-build-mvp/>}]

It just does not scrape the urls.它只是不抓取网址。 I followed everything said in this question but still could not make it work.我遵循了这个问题中所说的一切,但仍然无法使其工作。

You have to visit the detail page from the listing to scrape the article.您必须访问列表中的详细信息页面才能抓取文章。

In that case you have to yield the URL first then yield the data in the last spider在这种情况下,您必须先生成 URL,然后在最后一个蜘蛛中生成数据

Also, the //*[@id="main-content"]/article/div/div/div[1]/section/text() won't return you any text since there are lots of HTML elements under the section tag此外, //*[@id="main-content"]/article/div/div/div[1]/section/text()不会返回任何文本,因为该section下有很多 HTML 元素标签

One solution is you can scrape all the HTML element inside section tag and clean them later to get your article text data一种解决方案是您可以抓取section标签内的所有 HTML 元素并稍后清理它们以获取您的文章文本数据

here is the full working code这是完整的工作代码

import re

import scrapy
from urllib.parse import urljoin


class SitepointSpider(scrapy.Spider):
    # TODO: Add url tags (like /javascript) to the spider based on class paraneters
    name = "sitepoint"
    allowed_domains = ["sitepoint.com"]
    start_urls = ["http://sitepoint.com/javascript/"]

    def clean_text(self, raw_html):
        """
        :param raw_html: this will take raw html code
        :return: text without html tags
        """
        cleaner = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
        return re.sub(cleaner, '', raw_html)

    def parse(self, response):
        for article in response.css("article"):
            title = article.css("a.t12xxw3g::text").get()
            href = article.css("a.t12xxw3g::attr(href)").get()
            img = article.css("img.f13hvvvv::attr(src)").get()
            time = article.css("time::text").get()
            url = urljoin("https://sitepoint.com", href)
            yield scrapy.Request(url, callback=self.parse_article, meta={"title": title,
                                                                         "href": href,
                                                                         "img": img,
                                                                         "time": time})

    def parse_article(self, response):
        title = response.request.meta["title"]
        href = response.request.meta["href"]
        img = response.request.meta["img"]
        time = response.request.meta["time"]
        all_data = {}
        article_html = response.xpath('//*[@id="main-content"]/article/div/div/div[1]/section').get()
        all_data["title"] = title
        all_data["href"] = href
        all_data["img"] = img
        all_data["time"] = time
        all_data["text"] = self.clean_text(article_html)

        yield all_data

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM