[英]When scraping website using scrapy, How to generate request dynamically?
I am trying to scrape data from website using scrapy.我正在尝试使用scrapy从网站上抓取数据。 But Page is generated using Request which means url + other data(I found that using developer tools of browser).
但是页面是使用请求生成的,这意味着 url + 其他数据(我发现使用浏览器的开发人员工具)。 I copy the request as cURL and translated it to scrapy using curl to scrapy link.
我将请求复制为 cURL 并使用 curl 将其转换为scrapy 到 Z3CD13A277FBC2FEA5EF643648C链接。 I can get data from these page when I create request manually and then fetch.
当我手动创建请求然后获取时,我可以从这些页面获取数据。 So I need to create request manually.
所以我需要手动创建请求。 My Question is how can I create request dynamically in spider.
我的问题是如何在蜘蛛中动态创建请求。
class PowerGenerationSpider(scrapy.Spider):
name = "power_generation"
# make request
url = 'http://my/website'
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0",
"Accept": "application/json, text/plain, */*",
"Accept-Language": "en-US,en;q=0.5",
"Authorization": "authorization_string",
"Content-Type": "application/json",
"Origin": "http://my/website",
"DNT": "1",
"Connection": "keep-alive",
"Referer": "mywebsite"
}
# need to run this for month 01 to 10 or how can i change this programatically
body = '{"month":"01","year":"2020","user_id":"vishvajeet"}'
request = scrapy.Request(
url=url,
method='POST',
dont_filter=True,
headers=headers,
body=body,
)
def start_requests(self):
yield self.request #scrapy.Request(self.request, self.parse)
def parse(self, response):
data = json.loads(response.text)
solar_energy_datalist = data["resultObject"]
item = PowerGenerationItem()
for solar_energy_data in solar_energy_datalist:
date = solar_energy_data['date']
power_generation = solar_energy_data['power_generation']
item['date'] = date
item['power_generation'] = power_generation
yield item
how can i generate request with diffrent parameter inside body of request and pass it to the crawl for next request to crawl.如何在请求正文中生成具有不同参数的请求并将其传递给抓取以供下一个抓取请求。 Note: I found other resource on web which tell how to generate url dynamically.
注意:我在 web 上找到了其他资源,它告诉如何动态生成 url。 This is not about url, I want to generate request becaues URL is same.
这与 url 无关,我想生成请求,因为 URL 是相同的。
edit 1: cURL which i converted to scrapy request is below.编辑 1:我转换为 scrapy 请求的 cURL 如下。 i might delete authentication related information later.
我稍后可能会删除身份验证相关信息。
curl 'http://3.6.0.2/inject-solar-angular/inject_solar_server/graph/Graph/cumulative_month_graph' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0' -H 'Accept: application/json, text/plain, */*' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Authorization: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpZCI6IjkwIiwicm9sZSI6IjQiLCJ0aW1lc3RhbXAiOjE1OTU1NjcxMjEsInN0YXR1cyI6MX0.-RFFUc69PxDsAk_zmn3VI8OqUh-mYkYioFyTSBU17_s' -H 'Content-Type: application/json' -H 'Origin: http://www.injectsolar.com' -H 'DNT: 1' -H 'Connection: keep-alive' -H 'Referer: http://www.injectsolar.com/portal/' --data-raw '{"month":"01","year":"2020","user_id":"triose"}'
edit: 2编辑:2
I am trying to fetch jan-20 to june-20 data on daily basis of power generation.我正在尝试每天获取 1 月 20 日至 6 月 20 日的发电数据。 for that i need to change the parameter inside body.
为此,我需要更改体内的参数。 If inside body directory month=01 it give jan-20 data.
如果在正文目录中月份=01,它会给出 1 月 20 日的数据。 If i change it to month=02 it give me feb data.
如果我将其更改为月 = 02,它会给我 2 月的数据。 But I want to do this automatically through crawler.
但我想通过爬虫自动执行此操作。 As if crawler crawl from one page to other page.
就好像爬虫从一页爬到另一页一样。 It should give me data like that.
它应该给我这样的数据。
A few things in terms of the code provided.关于提供的代码的一些事情。 I would like you to provide abit more code of an attempt at trying to generate the request with different body parameters dynamically.
我希望您提供更多代码来尝试动态生成具有不同主体参数的请求。 This is something I don't see in your code yet, so there's only so much I can help.
这是我在你的代码中还没有看到的东西,所以我能提供的帮助只有这么多。 But there's enough in this text here to show you how to transfer ANY information you want from one function to another so you should be able to make an attempt.
但是这里的文本足以向您展示如何将您想要的任何信息从一个 function 传输到另一个,因此您应该能够进行尝试。
As another disclaimer, because I don't have the URL, I can't run this code myself, so again it's hard to unpick if there's problems with it.作为另一个免责声明,因为我没有 URL,我无法自己运行此代码,因此如果有问题,也很难取消选择。
start_requests
functionstart_requests
functionself.headers
self.headers
cb_kwargs = self.body
.cb_kwargs = self.body
。 The cb_kwargs
argument is used to transfer any dictionary of information you want from one function to another. cb_kwargs
参数用于将您想要的任何信息字典从一个 function 传输到另一个。 Headers is a defined argument in scrapy.Request
, so it doesn't apply to that. scrapy.Request
中定义的参数,因此不适用于该参数。 But the body dictionary you created is user created information and the cb_kwargs argument applies to this.self.parse
which is the default callback for the start_requests
function.self.parse
,它是start_requests
function 的默认回调。start_urls
.start_urls
中的 URL 不同。 Refer to the URL argument in the scrapy.Request as self.start_urls[0]
.self.start_urls[0]
。 In this case you've defined the variable url, so in your scrapy.Request
you need to refer to it as self.url
scrapy.Request
中,您需要将其称为self.url
class PowerGenerationSpider(scrapy.Spider):
name = "power_generation"
# make request
url = 'http://my/website'
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0",
"Accept": "application/json, text/plain, */*",
"Accept-Language": "en-US,en;q=0.5",
"Authorization": "authorization_string",
"Content-Type": "application/json",
"Origin": "http://my/website",
"DNT": "1",
"Connection": "keep-alive",
"Referer": "mywebsite"
}
# need to run this for month 01 to 10 or how can i change this programatically
body = {"month":"01","year":"2020","user_id":"vishvajeet"}
def start_requests(self):
yield scrapy.Request(
url=self.url,
method='POST',
dont_filter=True,
headers=self.headers,
cb_kwargs=self.body,
callback=self.parse
)
def parse(self, response):
data = json.loads(response.text)
solar_energy_datalist = data["resultObject"]
item = PowerGenerationItem()
for solar_energy_data in solar_energy_datalist:
date = solar_energy_data['date']
power_generation = solar_energy_data['power_generation']
item['date'] = date
item['power_generation'] = power_generation
yield item
Remember that the start_requests
function is used to generate requests for the start_urls.请记住,
start_requests
function 用于生成对 start_urls 的请求。
The self.VARIABLE
comes from knowing about classes. self.VARIABLE
来自于对类的了解。 Understanding what this means is essential to following along this example.理解这意味着什么对于遵循这个例子至关重要。 The beauty of
self.VARIABLE
means that if you define a variable in a class that is NOT within a funtion in the class, it can be used in ANY function within the class but you have to remember to refer to as self.VARIABLE
in each function. The beauty of
self.VARIABLE
means that if you define a variable in a class that is NOT within a funtion in the class, it can be used in ANY function within the class but you have to remember to refer to as self.VARIABLE
in each function。 This type of variable is called a class variable.这种类型的变量称为 class 变量。
Consider ItemLoaders for extracting data if you need to modify the responses scrapy gives you or want to apply a small function to the extracted data to clean it up.如果您需要修改 scrapy 给您的响应,或者想要对提取的数据应用一个小的 function 以清理它,请考虑使用 ItemLoaders 提取数据。 ItemLoaders gives you a lot more freedom to change the outputs of the scrapy responses.
ItemLoaders 让您可以更自由地更改 scrapy 响应的输出。
scrapy.Request
here if you read carefully it will explain about cb_kwargs here but also you should have a better understanding of the arguments you can use.scrapy.Request
文档。如果您仔细阅读此处的请求,它将在此处解释有关 cb_kwargs 的信息,但您也应该更好地了解您可以使用的 arguments。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.