繁体   English   中英

Scrapy 爬取带有 PostBack 数据的页面 javascript url 没有改变

[英]Scrapy crawling through pages with PostBack data javascript url doesn't change

我正在通过 ASP.NET 编程通过 Scrapy 爬过一些目录。

要抓取的页面编码如下:

javascript:__doPostBack('MoreInfoListZbgs1$Pager','X')

其中 X 是 1 到 180 之间的整数。问题是当我单击下一页或任何页面时,url 保持不变。 我在下面写下了一些代码,它们只能提取第一页中的每个链接。

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
import re
from scrapy.http import FormRequest
import js2xml
import requests
from datetime import datetime


class nnggzySpider(scrapy.Spider):

    name = 'nnggzygov'
    start_urls = [
        'https://www.nnggzy.org.cn/gxnnzbw/showinfo/zbxxmore.aspx?categorynum=001004001'
    ]

    base_url = 'https://www.nnggzy.org.cn'


    custom_settings = {
        'LOG_LEVEL': 'ERROR'
    }

    def parse(self, response):
        _response = response.text
        self.data = {}
        soup = BeautifulSoup(response.body, 'html.parser')
        tags = soup.find_all('a', href=re.compile(r"InfoDetail"))

        # 获取翻页参数
        __VIEWSTATE = re.findall(r'id="__VIEWSTATE" value="(.*?)" />', _response)
        A = __VIEWSTATE[0]
        # print(A)
        __EVENTTARGET = 'MoreInfoListZbgs1$Pager'
        B = __EVENTTARGET
        __CSRFTOKEN = re.findall(r'id="__CSRFTOKEN" value="(.*?)" />', _response)
        C = __CSRFTOKEN
        page_num = re.findall(r'title="转到第(.*?)页"', _response)
        max_page = page_num[-1]

        content = {
            '__VIEWSTATE': A,
            '__EVENTTARGET': B,
            '__CSRFTOKEN': C,
            'page_num': max_page
        }
        infoid = re.findall(r'InfoID=(.*?)&CategoryNum', _response)
        print(infoid)
        yield scrapy.Request(url=response.url, callback=self.parse_detail, meta={"data": content})

    def parse_detail(self, response):
        max_page = response.meta['data']['page_num']
        for i in range(2, int(max_page)):
            data = {
                '__CSRFTOKEN': '{}'.format(response.meta['data']['__CSRFTOKEN']),
                '__VIEWSTATE': '{}'.format(response.meta['data']['__VIEWSTATE']),
                '__EVENTTARGET': 'MoreInfoListZbgs1$Pager',
                '__EVENTARGUMENT': '{}'.format(i),
                # '__VIEWSTATEENCRYPTED': '',
                # 'txtKey': ''
            }
            yield scrapy.FormRequest(url=response.url, callback=self.parse, formdata=data, method="POST", dont_filter=True)

谁能帮我这个?

看起来上述网站的分页是通过发送带有 formdata 的 POST 请求来完成的,例如:

{
    "__CSRFTOKEN": ...,
    "__VIEWSTATE": ...,
    "__EVENTTARGET": "MoreInfoListZbgs1$Pager",
    "__EVENTARGUMENT": page_number,
    "__VIEWSTATEENCRYPTED": "",
    "txtKey": ""
}

我知道这是一个有一年历史的话题,但我正在为未来来自谷歌搜索的访问者发布答案。

您的表单提交无效,因为在 web 页面底部但在表单内部肯定还有一些隐藏字段。 就我而言,这是工作提交

# This is the next page link
# <a id="nextId" href="javascript:__doPostBack('MoreInfoListZbgs1$Pager','')"> Next </a>

# This is how the website evaluate the next link
# <script type="text/javascript">
# //<![CDATA[
# var theForm = document.forms['Form1'];
# if (!theForm) {
#     theForm = document.Form1;
# }
# function __doPostBack(eventTarget, eventArgument) {
#     if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
#         theForm.__EVENTTARGET.value = eventTarget;
#         theForm.__EVENTARGUMENT.value = eventArgument;
#         theForm.submit();
#     }
# }
# //]]>
# </script>

# According to above js code, we need to pass in the following arguments:
data = {
    '__EVENTTARGET': 'MoreInfoListZbgs1$Pager', # first argument from javascript:__doPostBack('MoreInfoListZbgs1$Pager','') next link
    '__EVENTARGUMENT': '', # second argument from javascript:__doPostBack('MoreInfoListZbgs1$Pager','') next link, in my case it is empty
    '__VIEWSTATE': response.css('input[name=__VIEWSTATE]::attr("value")').get(),

    #  These are the more hidden input fields you need to pass in
    '__VIEWSTATEGENERATOR': response.css('input[name=__VIEWSTATEGENERATOR]::attr("value")').get(),
    '__EVENTVALIDATION': response.css('input[name=__EVENTVALIDATION]::attr("value")').get(),
}

yield scrapy.FormRequest(url=form_action_url_here, formdata=data, callback=self.parse)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM