簡體   English   中英

如何在 Scrapy 中沒有表單的情況下從輸入中獲取數據

[英]How can I get data from input without a form in Scrapy

我正在 Python Scrapy 上編寫一個網絡爬蟲,它應該從這個網站獲取輸入數據。

當我在站點左側選擇一個狀態時,它會發送一個POST請求。 POST請求(選擇狀態:“Alaska”):

{
    "CurrentPage": "1",
    "SearchType": "org",
    "GroupExemption": "",
    "AffiliateOrgName": "",
    "RelatedOrgName": "",
    "RelatedOrgEin": "",
    "RelationType": "",
    "RelatedOrgs": "",
    "SelectedCityNav[]": "",
    "SelectedCountyNav[]": "",
    "Eins": "",
    "ul": "",
    "PCSSubjectCodes[]": "",
    "PeoplePCSSubjectCodes[]": "",
    "PCSPopulationCodes[]": "",
    "AutoSelectTaxonomyFacet": "",
    "AutoSelectTaxonomyText": "",
    "Keywords": "",
    "State": "Alaska",
    "City": "",
    "PeopleZip": "",
    "PeopleZipRadius": "Zip+Only",
    "PeopleCity": "",
    "PeopleRevenueRangeLow": "$0",
    "PeopleRevenueRangeHigh": "max",
    "PeopleAssetsRangeLow": "$0",
    "PeopleAssetsRangeHigh": "max",
    "Sort": ""
}

問題是沒有表格,只有輸入,我不知道如何處理。 我正在使用scrapy.http.Request發送POST請求。 當我抓取我的蜘蛛時,它給了我網站的 html,沒有選擇任何狀態。 我的蜘蛛:

import urllib
import scrapy
from scrapy.http import Request
from scrapy.utils.response import open_in_browser


class NonprofitSpider(scrapy.Spider):
    name = 'nonprofit'

    def parse(self, response):

        url = 'https://www.guidestar.org/search' # or maybe 'https://www.guidestar.org/search/SubmitSearch'?

        headers = {
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        }

        data = {
            "CurrentPage": "1",
            "SearchType": "org",
            "GroupExemption": "",
            "AffiliateOrgName": "",
            "RelatedOrgName": "",
            "RelatedOrgEin": "",
            "RelationType": "",
            "RelatedOrgs": "",
            "SelectedCityNav[]": "",
            "SelectedCountyNav[]": "",
            "Eins": "",
            "ul": "",
            "PCSSubjectCodes[]": "",
            "PeoplePCSSubjectCodes[]": "",
            "PCSPopulationCodes[]": "",
            "AutoSelectTaxonomyFacet": "",
            "AutoSelectTaxonomyText": "",
            "Keywords": "",
            "State": "Alaska",
            "City": "",
            "PeopleZip": "",
            "PeopleZipRadius": "Zip+Only",
            "PeopleCity": "",
            "PeopleRevenueRangeLow": "$0",
            "PeopleRevenueRangeHigh": "max",
            "PeopleAssetsRangeLow": "$0",
            "PeopleAssetsRangeHigh": "max",
            "Sort": ""
        }

        return Request(
            url=url,
            method='POST',
            headers=headers,
            body=urllib.parse.urlencode(data),
            callback=self.start
        )

    def start(self, response):

        print('response', response)

        open_in_browser(response)

重新創建請求后,您可以解析 json 文件中的數據

import urllib
import scrapy
from scrapy.http import Request


class NonprofitSpider(scrapy.Spider):
    name = 'nonprofit'
    start_urls = ['https://www.guidestar.org/search']

    def parse(self, response):
        url = 'https://www.guidestar.org/search/SubmitSearch'

        headers = {
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        }

        data = {
            "CurrentPage": "1",
            "SearchType": "org",
            "GroupExemption": "",
            "AffiliateOrgName": "",
            "RelatedOrgName": "",
            "RelatedOrgEin": "",
            "RelationType": "",
            "RelatedOrgs": "",
            "SelectedCityNav[]": "",
            "SelectedCountyNav[]": "",
            "Eins": "",
            "ul": "",
            "PCSSubjectCodes[]": "",
            "PeoplePCSSubjectCodes[]": "",
            "PCSPopulationCodes[]": "",
            "AutoSelectTaxonomyFacet": "",
            "AutoSelectTaxonomyText": "",
            "Keywords": "",
            "State": "Alaska",
            "City": "",
            "PeopleZip": "",
            "PeopleZipRadius": "Zip+Only",
            "PeopleCity": "",
            "PeopleRevenueRangeLow": "$0",
            "PeopleRevenueRangeHigh": "max",
            "PeopleAssetsRangeLow": "$0",
            "PeopleAssetsRangeHigh": "max",
            "Sort": ""
        }

        return Request(
            url=url,
            method='POST',
            headers=headers,
            body=urllib.parse.urlencode(data),
            callback=self.start
        )

    def start(self, response):
        print('response', response)
        json_data = response.json()
        for element in json_data['Hits']:
            OrgName = element['OrgName']
            Ein = element['Ein']
            # ...
            # ...
            # ...
            # ...
            # and so on

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM