[英]How to disable SSL verification on Python Scrapy?
我正在用 PHP 编写过去 3 年的数据抓取脚本。
这是一个简单的 PHP 脚本
$url = 'https://appext20.dos.ny.gov/corp_public/CORPSEARCH.SELECT_ENTITY';
$fields = array(
'p_entity_name' => urlencode('AAA'),
'p_name_type' => urlencode('A'),
'p_search_type' => urlencode('BEGINS')
);
//url-ify the data for the POST
foreach ($fields as $key => $value) {
$fields_string .= $key . '=' . $value . '&';
}
$fields_string = rtrim($fields_string, '&');
//open connection
$ch = curl_init();
//set the url, number of POST vars, POST data
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_POST, count($fields));
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields_string);
//execute post
$result = curl_exec($ch);
print curl_error($ch) . '<br>';
print curl_getinfo($ch, CURLINFO_HTTP_CODE) . '<br>';
print $result;
仅当CURLOPT_SSL_VERIFYPEER
为false
它才能正常工作。 如果我们启用CURLOPT_SSL_VERIFYPEER
或者使用http
而不是https
它将返回空响应。
但是,我必须在 Python Scrapy 中做同样的项目,这里是 Scrapy 中的相同代码。
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http.request import Request
import urllib
from appext20.items import Appext20Item
class Appext20Spider(CrawlSpider):
name = "appext20"
allowed_domains = ["appext20.dos.ny.gov"]
DOWNLOAD_HANDLERS = {
'https': 'my.custom.downloader.handler.https.HttpsDownloaderIgnoreCNError',}
def start_requests(self):
payload = {"p_entity_name": 'AMEB', "p_name_type": 'A', 'p_search_type':'BEGINS'}
url = 'https://appext20.dos.ny.gov/corp_public/CORPSEARCH.SELECT_ENTITY'
yield Request(url, self.parse_data, method="POST", body=urllib.urlencode(payload))
def parse_data(self, response):
print('here is repos')
print response
它返回空响应。 它需要禁用 SSL 验证。
请原谅我对 Python Scrapy 的知识缺乏,我已经搜索了很多关于它但没有找到任何解决方案。
我建议看看这个页面: http : //doc.scrapy.org/en/1.0/topics/settings.html看起来你可以改变模块的行为方式并更改各种处理程序的设置。
我也相信这是一个重复的问题: 在 Scrapy 中禁用 SSL 证书验证
HTH
谢谢,
//P
这段代码对我有用
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import FormRequest
import urllib
from appext20.items import Appext20Item
from scrapy.selector import HtmlXPathSelector
class Appext20Spider(CrawlSpider):
name = "appext20"
allowed_domains = ["appext20.dos.ny.gov"]
payload = {"p_entity_name": 'AME', "p_name_type": 'A', 'p_search_type':'BEGINS'}
def start_requests(self):
url = 'https://appext20.dos.ny.gov/corp_public/CORPSEARCH.SELECT_ENTITY'
return [ FormRequest(url,
formdata= self.payload,
callback=self.parse_data) ]
def parse_data(self, response):
print('here is response')
questions = HtmlXPathSelector(response).xpath("//td[@headers='c1']")
# print questions
all_links = []
for tr in questions:
temp_dict = {}
temp_dict['link'] = tr.xpath('a/@href').extract()
temp_dict['title'] = tr.xpath('a/text()').extract()
all_links.extend([temp_dict])
print (all_links)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.