繁体   English   中英

使用Python的请求库缺少标头信息

[英]Missing header information using Python's requests library

我正在使用Python的(3.5.2)请求库(2.12.4)将查询发布到Primer-BLAST网站。 以下是我为此任务编写的脚本:

#!/usr/bin/env python

import requests

# BaseURL being accessed
url = 'https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi'

# Dictionary of query parameters
data = {
    'INPUT_SEQUENCE' : 'TCTTCTGAGAAAGTCTGAGGCTCCTTAGTACCTTCTCTAGTATGAACTGTTCAGCCTGCCCGCAAGTTGTAACTACGCAGGCGCCAAGACAGCCAACCAAGGAGGCTGCAGA',
    'ORGANISM'       : 'Mus musculus'
}

# Make a POST request and read the response
with requests.session() as session:
    poster = session.post(url, data=data)
    for key, value in poster.headers.items():
        print(key, ':', value)

我需要从响应的标头信息中检索NCBI-RCGI-RetryURL字段。 但是,仅当我在Google Chrome中使用HTTP跟踪扩展名时,才能看到此字段。 以下是使用Google Chrome浏览器进行的POST和响应的完整跟踪:

POST https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi
Origin: https://www.ncbi.nlm.nih.gov
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36
Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryBflp51Ny9ReeA5A9
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Referer: https://www.ncbi.nlm.nih.gov/tools/primer-blast/index.cgi?LINK_LOC=reset
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.8
Cookie: sv-userdata-key-www.ncbi.nlm.nih.gov=G5KxXzyQ81U_vs1aHK_7XDWciF1B8AjjDUmDunVbhIZhZ4p4t_SVK4ASpbTT8iDSJVcxBH9oUAB3K2xNWjp3G0koYCloBlYuSxdoIGIkYzl2; ncbi_sid=0751457F9A561D01_0000SID; _ga=GA1.2.567134514.1503994317; _gid=GA1.2.654339584.1503994317; _gat=1; starnext=MYGwlsDWB2CmAeAXAXAbgK7RAewIYBM4lkAmAXgAcAnMAW1ioCMRcBnRAMgBYzm3FWsXFWAALDgEZymHAUkBOMgAYA7AFYpXFQDF5AQTUA2ACIBRFRKVXrN2xI4klygMJcSltQA59R0wGY/S1tg63t3Shp6JhZ2AFI/PQA5AHlE03i9PnZBYTEMlLSHcgB3UoA6aGBGMAqQWgqwUTKAc2wANwceajoGLMR81NMHQwie6P4BtIy+nJFxEhVRqL7J9ISZoTnVh08l3pj+lWcC9KON3NFYo5OHRQkuLiUOPyd5K2eJMnvH5/JPDWefjIADNcCBBM8eIgqOhYM81F83GpniMJH4SPJnosSFxnrtAoZ5Li/IoXp5DCpuE50RIpNxPlIAtxyNBcIgwG04Q8yDI8IQEJwuAiSBw1EDvk81DxPEo/KKEfIRUYyCRDIZRYtJbtaT81HcOIYnE9DAycUA=

HTTP/1.1 200 OK
Date: Tue, 29 Aug 2017 13:38:27 GMT
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Referrer-Policy: origin-when-cross-origin
Content-Security-Policy: upgrade-insecure-requests
Cache-Control: no-cache, no-store, max-age=0, private, must-revalidate
Expires: 0
NCBI-PHID: 0C421A7A9A56E5310000000000000001.m_2
NCBI-RCGI-RetryURL: https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi?ctg_time=1504013907&job_key=aWO2H68Wor6FhLSBueGQs8P6gYHu6Zqc7w
NCBI-SID: 0751457F9A561D01_0000SID
Pragma: no-cache
Access-Control-Allow-Methods: POST, GET, PUT, OPTIONS, PATCH, DELETE
Access-Control-Allow-Origin: https://www.ncbi.nlm.nih.gov
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: Origin,X-Accept-Charset,X-Accept,Content-Type,X-Requested-With,NCBI-SID,NCBI-PHID
Content-Type: text/html
Set-Cookie: ncbi_sid=0751457F9A561D01_0000SID; domain=.nih.gov; path=/; expires=Wed, 29 Aug 2018 13:38:27 GMT
Vary: Accept-Encoding
Content-Encoding: gzip
X-UA-Compatible: IE=Edge
X-XSS-Protection: 1; mode=block
Keep-Alive: timeout=1, max=9
Connection: Keep-Alive
Transfer-Encoding: chunked

这是我从脚本中获得的所有标头信息:

Date : Tue, 29 Aug 2017 14:41:08 GMT
Server : Apache
Strict-Transport-Security : max-age=31536000; includeSubDomains; preload
Referrer-Policy : origin-when-cross-origin
Content-Security-Policy : upgrade-insecure-requests
Accept-Ranges : bytes
Vary : Accept-Encoding
Content-Encoding : gzip
X-UA-Compatible : IE=Edge
X-XSS-Protection : 1; mode=block
Content-Length : 2516
Keep-Alive : timeout=1, max=10
Connection : Keep-Alive
Content-Type : text/html

NCBI-RCGI-RetryURL字段很重要,因为它包含我需要在其上执行GET请求才能检索结果的URL。

编辑:

根据Maurice Meyer的建议更新了脚本:

#!/usr/bin/env python

import requests

# BaseURL being accessed
url = 'https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi'

# Dictionary of query parameters
data = {
    'INPUT_SEQUENCE' : 'TCTTCTGAGAAAGTCTGAGGCTCCTTAGTACCTTCTCTAGTATGAACTGTTCAGCCTGCCCGCAAGTTGTAACTACGCAGGCGCCAAGACAGCCAACCAAGGAGGCTGCAGA',
    'ORGANISM'       : 'Mus musculus'
}

# Extra headers
headers = {
    'Origin' : 'https://www.ncbi.nlm.nih.gov',
    'Upgrade-Insecure-Requests' : '1',
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36',
    'Content-Type' : 'multipart/form-data; boundary=----WebKitFormBoundaryBflp51Ny9ReeA5A9',
    'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Referer' : 'https://www.ncbi.nlm.nih.gov/tools/primer-blast/index.cgi?LINK_LOC=reset',
    'Accept-Encoding' : 'gzip, deflate, br',
    'Accept-Language' : 'en-US,en;q=0.8',
    'Cookie' : 'sv-userdata-key-www.ncbi.nlm.nih.gov=G5KxXzyQ81U_vs1aHK_7XDWciF1B8AjjDUmDunVbhIZhZ4p4t_SVK4ASpbTT8iDSJVcxBH9oUAB3K2xNWjp3G0koYCloBlYuSxdoIGIkYzl2; ncbi_sid=0751457F9A561D01_0000SID; _ga=GA1.2.567134514.1503994317; _gid=GA1.2.654339584.1503994317; _gat=1; starnext=MYGwlsDWB2CmAeAXAXAbgK7RAewIYBM4lkAmAXgAcAnMAW1ioCMRcBnRAMgBYzm3FWsXFWAALDgEZymHAUkBOMgAYA7AFYpXFQDF5AQTUA2ACIBRFRKVXrN2xI4klygMJcSltQA59R0wGY/S1tg63t3Shp6JhZ2AFI/PQA5AHlE03i9PnZBYTEMlLSHcgB3UoA6aGBGMAqQWgqwUTKAc2wANwceajoGLMR81NMHQwie6P4BtIy+nJFxEhVRqL7J9ISZoTnVh08l3pj+lWcC9KON3NFYo5OHRQkuLiUOPyd5K2eJMnvH5/JPDWefjIADNcCBBM8eIgqOhYM81F83GpniMJH4SPJnosSFxnrtAoZ5Li/IoXp5DCpuE50RIpNxPlIAtxyNBcIgwG04Q8yDI8IQEJwuAiSBw1EDvk81DxPEo/KKEfIRUYyCRDIZRYtJbtaT81HcOIYnE9DAycUA='
}

# Make a POST request and read the response
with requests.session() as session:
    poster = session.post(url, data=data, headers=headers)
    for key, value in poster.headers.items():
        print(key, ':', value)

更新输出,仍然没有区别:

Date : Tue, 29 Aug 2017 15:05:27 GMT
Server : Apache
Strict-Transport-Security : max-age=31536000; includeSubDomains; preload
Referrer-Policy : origin-when-cross-origin
Content-Security-Policy : upgrade-insecure-requests
Accept-Ranges : bytes
Vary : Accept-Encoding
Content-Encoding : gzip
X-UA-Compatible : IE=Edge
X-XSS-Protection : 1; mode=block
Content-Length : 2517
Keep-Alive : timeout=1, max=10
Connection : Keep-Alive
Content-Type : text/html

两者之间的请求数据完全不同。

特别是请求正文数据。 因此,实际上并没有使用Python的请求库丢失标头信息,而是在POST请求到服务器中丢失了信息。

您不能简单地复制并粘贴标题

'Content-Type' : 'multipart/form-data; boundary=----WebKitFormBoundaryBflp51Ny9ReeA5A9',

或者只是发布数据INPUT_SEQUENCEORGANISM这样的-也是在任何情况下,你必须对生物体的数据显然是错误的-粗略地看一眼显示这将是小家鼠(taxid:10090)小家鼠

所以-您需要查看整个请求-标头和正文,然后设计一个包含服务器所需数据的请求。 查看它,您丢失了服务器需要响应的负载和数据负载。

------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="INPUT_SEQUENCE"

TCTTCTGAGAAAGTCTGAGGCTCCTTAGTACCTTCTCTAGTATGAACTGTTCAGCCTGCCCGCAAGTTGTAACTACGCAGGCGCCAAGACAGCCAACCAAGGAGGCTGCAGA
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SEQFILE"; filename=""
Content-Type: application/octet-stream


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER5_START"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER5_END"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER3_START"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER3_END"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_LEFT_INPUT"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_RIGHT_INPUT"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_PRODUCT_MIN"

70
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_PRODUCT_MAX"

1000
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_NUM_RETURN"

10
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_MIN_TM"

57.0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_OPT_TM"

60.0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_MAX_TM"

63.0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_MAX_DIFF_TM"

3
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_ON_SPLICE_SITE"

0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SPLICE_SITE_OVERLAP_5END"

7
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SPLICE_SITE_OVERLAP_3END"

4
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="MIN_INTRON_SIZE"

1000
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="MAX_INTRON_SIZE"

1000000
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SEARCH_SPECIFIC_PRIMER"

on
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SEARCHMODE"

0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_SPECIFICITY_DATABASE"

refseq_mrna
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="CUSTOM_DB"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="CUSTOMSEQFILE"; filename=""
Content-Type: application/octet-stream


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="ORGANISM"

Mus musculus (taxid:10090)
------WebKitFormBoundaryJVAJqDi2cI4BTfmc

etc...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM