简体   繁体   English

使用 scrapy 抓取网站时,如何动态生成请求?

[英]When scraping website using scrapy, How to generate request dynamically?

I am trying to scrape data from website using scrapy.我正在尝试使用scrapy从网站上抓取数据。 But Page is generated using Request which means url + other data(I found that using developer tools of browser).但是页面是使用请求生成的,这意味着 url + 其他数据(我发现使用浏览器的开发人员工具)。 I copy the request as cURL and translated it to scrapy using curl to scrapy link.我将请求复制为 cURL 并使用 curl 将其转换为scrapy 到 Z3CD13A277FBC2FEA5EF643648C链接。 I can get data from these page when I create request manually and then fetch.当我手动创建请求然后获取时,我可以从这些页面获取数据。 So I need to create request manually.所以我需要手动创建请求。 My Question is how can I create request dynamically in spider.我的问题是如何在蜘蛛中动态创建请求。

class PowerGenerationSpider(scrapy.Spider):
    name = "power_generation"
    # make request
    url = 'http://my/website'

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0",
    "Accept": "application/json, text/plain, */*",
    "Accept-Language": "en-US,en;q=0.5",
    "Authorization": "authorization_string",
    "Content-Type": "application/json",
    "Origin": "http://my/website",
    "DNT": "1",
    "Connection": "keep-alive",
    "Referer": "mywebsite"
}
# need to run this for month 01 to 10 or how can i change this programatically
body = '{"month":"01","year":"2020","user_id":"vishvajeet"}'

request = scrapy.Request(
    url=url,
    method='POST',
    dont_filter=True,
    headers=headers,
    body=body,
)

def start_requests(self):
    yield self.request #scrapy.Request(self.request, self.parse)

def parse(self, response):
    data = json.loads(response.text)
    solar_energy_datalist = data["resultObject"]
    item = PowerGenerationItem()
    for solar_energy_data in solar_energy_datalist:
        date = solar_energy_data['date']
        power_generation = solar_energy_data['power_generation']

        item['date'] = date
        item['power_generation'] = power_generation
        yield item

how can i generate request with diffrent parameter inside body of request and pass it to the crawl for next request to crawl.如何在请求正文中生成具有不同参数的请求并将其传递给抓取以供下一个抓取请求。 Note: I found other resource on web which tell how to generate url dynamically.注意:我在 web 上找到了其他资源,它告诉如何动态生成 url。 This is not about url, I want to generate request becaues URL is same.这与 url 无关,我想生成请求,因为 URL 是相同的。

edit 1: cURL which i converted to scrapy request is below.编辑 1:我转换为 scrapy 请求的 cURL 如下。 i might delete authentication related information later.我稍后可能会删除身份验证相关信息。

curl 'http://3.6.0.2/inject-solar-angular/inject_solar_server/graph/Graph/cumulative_month_graph' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0' -H 'Accept: application/json, text/plain, */*' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Authorization: eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpZCI6IjkwIiwicm9sZSI6IjQiLCJ0aW1lc3RhbXAiOjE1OTU1NjcxMjEsInN0YXR1cyI6MX0.-RFFUc69PxDsAk_zmn3VI8OqUh-mYkYioFyTSBU17_s' -H 'Content-Type: application/json' -H 'Origin: http://www.injectsolar.com' -H 'DNT: 1' -H 'Connection: keep-alive' -H 'Referer: http://www.injectsolar.com/portal/' --data-raw '{"month":"01","year":"2020","user_id":"triose"}'

edit: 2编辑:2

I am trying to fetch jan-20 to june-20 data on daily basis of power generation.我正在尝试每天获取 1 月 20 日至 6 月 20 日的发电数据。 for that i need to change the parameter inside body.为此,我需要更改体内的参数。 If inside body directory month=01 it give jan-20 data.如果在正文目录中月份=01,它会给出 1 月 20 日的数据。 If i change it to month=02 it give me feb data.如果我将其更改为月 = 02,它会给我 2 月的数据。 But I want to do this automatically through crawler.但我想通过爬虫自动执行此操作。 As if crawler crawl from one page to other page.就好像爬虫从一页爬到另一页一样。 It should give me data like that.它应该给我这样的数据。

A few things in terms of the code provided.关于提供的代码的一些事情。 I would like you to provide abit more code of an attempt at trying to generate the request with different body parameters dynamically.我希望您提供更多代码来尝试动态生成具有不同主体参数的请求。 This is something I don't see in your code yet, so there's only so much I can help.这是我在你的代码中还没有看到的东西,所以我能提供的帮助只有这么多。 But there's enough in this text here to show you how to transfer ANY information you want from one function to another so you should be able to make an attempt.但是这里的文本足以向您展示如何将您想要的任何信息从一个 function 传输到另一个,因此您应该能够进行尝试。

As another disclaimer, because I don't have the URL, I can't run this code myself, so again it's hard to unpick if there's problems with it.作为另一个免责声明,因为我没有 URL,我无法自己运行此代码,因此如果有问题,也很难取消选择。

Corrections更正

  1. Just put the scrapy Request within start_requests function只需将 scrapy 请求放入start_requests function
  2. Refering to variables outside the function requires you to add self.VARIABLE.引用 function 之外的变量需要添加 self.VARIABLE。 In this case self.headers在这种情况下self.headers
  3. To input the parameters you need to use the meta argument.要输入参数,您需要使用元参数。 So cb_kwargs = self.body .所以cb_kwargs = self.body The cb_kwargs argument is used to transfer any dictionary of information you want from one function to another. cb_kwargs参数用于将您想要的任何信息字典从一个 function 传输到另一个。 Headers is a defined argument in scrapy.Request , so it doesn't apply to that. Headers 是scrapy.Request中定义的参数,因此不适用于该参数。 But the body dictionary you created is user created information and the cb_kwargs argument applies to this.但是您创建的正文字典是用户创建的信息,并且 cb_kwargs 参数适用于此。
  4. You need to define a callback for the response to be processed.您需要为要处理的响应定义一个回调。 Here we define the callback as self.parse which is the default callback for the start_requests function.这里我们将回调定义为self.parse ,它是start_requests function 的默认回调。
  5. Unless the url is different to URL in the start_urls .除非 url 与start_urls中的 URL 不同。 Refer to the URL argument in the scrapy.Request as self.start_urls[0] .请参阅 scrapy.Request 中的 URL 参数作为self.start_urls[0] In this case you've defined the variable url, so in your scrapy.Request you need to refer to it as self.url在这种情况下,您已经定义了变量 url,因此在您的scrapy.Request中,您需要将其称为self.url

Code Example代码示例

   class PowerGenerationSpider(scrapy.Spider):
        name = "power_generation"
        # make request
        url = 'http://my/website'

        headers = {
           "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0",
           "Accept": "application/json, text/plain, */*",
           "Accept-Language": "en-US,en;q=0.5",
           "Authorization": "authorization_string",
           "Content-Type": "application/json",
           "Origin": "http://my/website",
           "DNT": "1",
           "Connection": "keep-alive",
           "Referer": "mywebsite"
              }
   # need to run this for month 01 to 10 or how can i change this programatically
        body = {"month":"01","year":"2020","user_id":"vishvajeet"}



        def start_requests(self):
             yield scrapy.Request(
                              url=self.url,
                              method='POST',
                              dont_filter=True,
                              headers=self.headers,
                              cb_kwargs=self.body,
                              callback=self.parse
                             )

        def parse(self, response):
            data = json.loads(response.text)
            solar_energy_datalist = data["resultObject"]
            item = PowerGenerationItem()
            for solar_energy_data in solar_energy_datalist:
                date = solar_energy_data['date']
                power_generation = solar_energy_data['power_generation']

                item['date'] = date
                item['power_generation'] = power_generation
                yield item

Additional Information附加信息

  1. Remember that the start_requests function is used to generate requests for the start_urls.请记住, start_requests function 用于生成对 start_urls 的请求。

  2. The self.VARIABLE comes from knowing about classes. self.VARIABLE来自于对类的了解。 Understanding what this means is essential to following along this example.理解这意味着什么对于遵循这个例子至关重要。 The beauty of self.VARIABLE means that if you define a variable in a class that is NOT within a funtion in the class, it can be used in ANY function within the class but you have to remember to refer to as self.VARIABLE in each function. The beauty of self.VARIABLE means that if you define a variable in a class that is NOT within a funtion in the class, it can be used in ANY function within the class but you have to remember to refer to as self.VARIABLE in each function。 This type of variable is called a class variable.这种类型的变量称为 class 变量。

  3. Consider ItemLoaders for extracting data if you need to modify the responses scrapy gives you or want to apply a small function to the extracted data to clean it up.如果您需要修改 scrapy 给您的响应,或者想要对提取的数据应用一个小的 function 以清理它,请考虑使用 ItemLoaders 提取数据。 ItemLoaders gives you a lot more freedom to change the outputs of the scrapy responses. ItemLoaders 让您可以更自由地更改 scrapy 响应的输出。

Resources for additional learning额外学习资源

  1. For more information about class variables please see here有关 class 变量的更多信息,请参见此处
  2. Read the scrapy docs for scrapy.Request here if you read carefully it will explain about cb_kwargs here but also you should have a better understanding of the arguments you can use.阅读 scrapy 的scrapy.Request文档。如果您仔细阅读此处的请求,它将在此处解释有关 cb_kwargs 的信息,但您也应该更好地了解您可以使用的 arguments。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM