[英]How can I get data from input without a form in Scrapy
我正在 Python Scrapy 上編寫一個網絡爬蟲,它應該從這個網站獲取輸入數據。
當我在站點左側選擇一個狀態時,它會發送一個POST
請求。 POST
請求(選擇狀態:“Alaska”):
{
"CurrentPage": "1",
"SearchType": "org",
"GroupExemption": "",
"AffiliateOrgName": "",
"RelatedOrgName": "",
"RelatedOrgEin": "",
"RelationType": "",
"RelatedOrgs": "",
"SelectedCityNav[]": "",
"SelectedCountyNav[]": "",
"Eins": "",
"ul": "",
"PCSSubjectCodes[]": "",
"PeoplePCSSubjectCodes[]": "",
"PCSPopulationCodes[]": "",
"AutoSelectTaxonomyFacet": "",
"AutoSelectTaxonomyText": "",
"Keywords": "",
"State": "Alaska",
"City": "",
"PeopleZip": "",
"PeopleZipRadius": "Zip+Only",
"PeopleCity": "",
"PeopleRevenueRangeLow": "$0",
"PeopleRevenueRangeHigh": "max",
"PeopleAssetsRangeLow": "$0",
"PeopleAssetsRangeHigh": "max",
"Sort": ""
}
問題是沒有表格,只有輸入,我不知道如何處理。 我正在使用scrapy.http.Request
發送POST
請求。 當我抓取我的蜘蛛時,它給了我網站的 html,沒有選擇任何狀態。 我的蜘蛛:
import urllib
import scrapy
from scrapy.http import Request
from scrapy.utils.response import open_in_browser
class NonprofitSpider(scrapy.Spider):
name = 'nonprofit'
def parse(self, response):
url = 'https://www.guidestar.org/search' # or maybe 'https://www.guidestar.org/search/SubmitSearch'?
headers = {
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
}
data = {
"CurrentPage": "1",
"SearchType": "org",
"GroupExemption": "",
"AffiliateOrgName": "",
"RelatedOrgName": "",
"RelatedOrgEin": "",
"RelationType": "",
"RelatedOrgs": "",
"SelectedCityNav[]": "",
"SelectedCountyNav[]": "",
"Eins": "",
"ul": "",
"PCSSubjectCodes[]": "",
"PeoplePCSSubjectCodes[]": "",
"PCSPopulationCodes[]": "",
"AutoSelectTaxonomyFacet": "",
"AutoSelectTaxonomyText": "",
"Keywords": "",
"State": "Alaska",
"City": "",
"PeopleZip": "",
"PeopleZipRadius": "Zip+Only",
"PeopleCity": "",
"PeopleRevenueRangeLow": "$0",
"PeopleRevenueRangeHigh": "max",
"PeopleAssetsRangeLow": "$0",
"PeopleAssetsRangeHigh": "max",
"Sort": ""
}
return Request(
url=url,
method='POST',
headers=headers,
body=urllib.parse.urlencode(data),
callback=self.start
)
def start(self, response):
print('response', response)
open_in_browser(response)
重新創建請求后,您可以解析 json 文件中的數據
import urllib
import scrapy
from scrapy.http import Request
class NonprofitSpider(scrapy.Spider):
name = 'nonprofit'
start_urls = ['https://www.guidestar.org/search']
def parse(self, response):
url = 'https://www.guidestar.org/search/SubmitSearch'
headers = {
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
}
data = {
"CurrentPage": "1",
"SearchType": "org",
"GroupExemption": "",
"AffiliateOrgName": "",
"RelatedOrgName": "",
"RelatedOrgEin": "",
"RelationType": "",
"RelatedOrgs": "",
"SelectedCityNav[]": "",
"SelectedCountyNav[]": "",
"Eins": "",
"ul": "",
"PCSSubjectCodes[]": "",
"PeoplePCSSubjectCodes[]": "",
"PCSPopulationCodes[]": "",
"AutoSelectTaxonomyFacet": "",
"AutoSelectTaxonomyText": "",
"Keywords": "",
"State": "Alaska",
"City": "",
"PeopleZip": "",
"PeopleZipRadius": "Zip+Only",
"PeopleCity": "",
"PeopleRevenueRangeLow": "$0",
"PeopleRevenueRangeHigh": "max",
"PeopleAssetsRangeLow": "$0",
"PeopleAssetsRangeHigh": "max",
"Sort": ""
}
return Request(
url=url,
method='POST',
headers=headers,
body=urllib.parse.urlencode(data),
callback=self.start
)
def start(self, response):
print('response', response)
json_data = response.json()
for element in json_data['Hits']:
OrgName = element['OrgName']
Ein = element['Ein']
# ...
# ...
# ...
# ...
# and so on
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.