[英]Scrapy.Request returns <GET url> without scraping anything
I wanted to scrape the feed of sitepoint.com, this is my code:我想抓取 sitepoint.com 的提要,这是我的代码:
import scrapy
from urllib.parse import urljoin
class SitepointSpider(scrapy.Spider):
# TODO: Add url tags (like /javascript) to the spider based on class paraneters
name = "sitepoint"
allowed_domains = ["sitepoint.com"]
start_urls = ["http://sitepoint.com/javascript/"]
def parse(self, response):
data = []
for article in response.css("article"):
title = article.css("a.t12xxw3g::text").get()
href = article.css("a.t12xxw3g::attr(href)").get()
img = article.css("img.f13hvvvv::attr(src)").get()
time = article.css("time::text").get()
url = urljoin("https://sitepoint.com", href)
text = scrapy.Request(url, callback=self.parse_article)
data.append(
{"title": title, "href": href, "img": img, "time": time, "text": text}
)
yield data
def parse_article(self, response):
text = response.xpath(
'//*[@id="main-content"]/article/div/div/div[1]/section/text()'
).extract()
yield text
And this is the response I get:-这是我得到的回应:-
[{'title': 'How to Build an MVP with React and Firebase',
'href': '/react-firebase-build-mvp/',
'img': 'https://uploads.sitepoint.com/wp-content/uploads/2021/09/1632802723react-firebase-mvp-
app.jpg',
'time': 'September 28, 2021',
'text': <GET https://sitepoint.com/react-firebase-build-mvp/>}]
It just does not scrape the urls.它只是不抓取网址。 I followed everything said in this question but still could not make it work.我遵循了这个问题中所说的一切,但仍然无法使其工作。
You have to visit the detail page from the listing to scrape the article.您必须访问列表中的详细信息页面才能抓取文章。
In that case you have to yield the URL first then yield the data in the last spider在这种情况下,您必须先生成 URL,然后在最后一个蜘蛛中生成数据
Also, the //*[@id="main-content"]/article/div/div/div[1]/section/text()
won't return you any text since there are lots of HTML elements under the section
tag此外, //*[@id="main-content"]/article/div/div/div[1]/section/text()
不会返回任何文本,因为该section
下有很多 HTML 元素标签
One solution is you can scrape all the HTML element inside section
tag and clean them later to get your article text data一种解决方案是您可以抓取section
标签内的所有 HTML 元素并稍后清理它们以获取您的文章文本数据
here is the full working code这是完整的工作代码
import re
import scrapy
from urllib.parse import urljoin
class SitepointSpider(scrapy.Spider):
# TODO: Add url tags (like /javascript) to the spider based on class paraneters
name = "sitepoint"
allowed_domains = ["sitepoint.com"]
start_urls = ["http://sitepoint.com/javascript/"]
def clean_text(self, raw_html):
"""
:param raw_html: this will take raw html code
:return: text without html tags
"""
cleaner = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
return re.sub(cleaner, '', raw_html)
def parse(self, response):
for article in response.css("article"):
title = article.css("a.t12xxw3g::text").get()
href = article.css("a.t12xxw3g::attr(href)").get()
img = article.css("img.f13hvvvv::attr(src)").get()
time = article.css("time::text").get()
url = urljoin("https://sitepoint.com", href)
yield scrapy.Request(url, callback=self.parse_article, meta={"title": title,
"href": href,
"img": img,
"time": time})
def parse_article(self, response):
title = response.request.meta["title"]
href = response.request.meta["href"]
img = response.request.meta["img"]
time = response.request.meta["time"]
all_data = {}
article_html = response.xpath('//*[@id="main-content"]/article/div/div/div[1]/section').get()
all_data["title"] = title
all_data["href"] = href
all_data["img"] = img
all_data["time"] = time
all_data["text"] = self.clean_text(article_html)
yield all_data
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.