简体   繁体   中英

Missing header information using Python's requests library

I am using Python's (3.5.2) requests library (2.12.4) to post a query to the Primer-BLAST website. Below is the script I've written for this task:

#!/usr/bin/env python

import requests

# BaseURL being accessed
url = 'https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi'

# Dictionary of query parameters
data = {
    'INPUT_SEQUENCE' : 'TCTTCTGAGAAAGTCTGAGGCTCCTTAGTACCTTCTCTAGTATGAACTGTTCAGCCTGCCCGCAAGTTGTAACTACGCAGGCGCCAAGACAGCCAACCAAGGAGGCTGCAGA',
    'ORGANISM'       : 'Mus musculus'
}

# Make a POST request and read the response
with requests.session() as session:
    poster = session.post(url, data=data)
    for key, value in poster.headers.items():
        print(key, ':', value)

I need to retrieve the NCBI-RCGI-RetryURL field from the response's header information. However, I can only see this field when I use the HTTP trace extension in Google Chrome. Below is the full trace of the POST and response using Google Chrome:

POST https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi
Origin: https://www.ncbi.nlm.nih.gov
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36
Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryBflp51Ny9ReeA5A9
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Referer: https://www.ncbi.nlm.nih.gov/tools/primer-blast/index.cgi?LINK_LOC=reset
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.8
Cookie: sv-userdata-key-www.ncbi.nlm.nih.gov=G5KxXzyQ81U_vs1aHK_7XDWciF1B8AjjDUmDunVbhIZhZ4p4t_SVK4ASpbTT8iDSJVcxBH9oUAB3K2xNWjp3G0koYCloBlYuSxdoIGIkYzl2; ncbi_sid=0751457F9A561D01_0000SID; _ga=GA1.2.567134514.1503994317; _gid=GA1.2.654339584.1503994317; _gat=1; starnext=MYGwlsDWB2CmAeAXAXAbgK7RAewIYBM4lkAmAXgAcAnMAW1ioCMRcBnRAMgBYzm3FWsXFWAALDgEZymHAUkBOMgAYA7AFYpXFQDF5AQTUA2ACIBRFRKVXrN2xI4klygMJcSltQA59R0wGY/S1tg63t3Shp6JhZ2AFI/PQA5AHlE03i9PnZBYTEMlLSHcgB3UoA6aGBGMAqQWgqwUTKAc2wANwceajoGLMR81NMHQwie6P4BtIy+nJFxEhVRqL7J9ISZoTnVh08l3pj+lWcC9KON3NFYo5OHRQkuLiUOPyd5K2eJMnvH5/JPDWefjIADNcCBBM8eIgqOhYM81F83GpniMJH4SPJnosSFxnrtAoZ5Li/IoXp5DCpuE50RIpNxPlIAtxyNBcIgwG04Q8yDI8IQEJwuAiSBw1EDvk81DxPEo/KKEfIRUYyCRDIZRYtJbtaT81HcOIYnE9DAycUA=

HTTP/1.1 200 OK
Date: Tue, 29 Aug 2017 13:38:27 GMT
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Referrer-Policy: origin-when-cross-origin
Content-Security-Policy: upgrade-insecure-requests
Cache-Control: no-cache, no-store, max-age=0, private, must-revalidate
Expires: 0
NCBI-PHID: 0C421A7A9A56E5310000000000000001.m_2
NCBI-RCGI-RetryURL: https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi?ctg_time=1504013907&job_key=aWO2H68Wor6FhLSBueGQs8P6gYHu6Zqc7w
NCBI-SID: 0751457F9A561D01_0000SID
Pragma: no-cache
Access-Control-Allow-Methods: POST, GET, PUT, OPTIONS, PATCH, DELETE
Access-Control-Allow-Origin: https://www.ncbi.nlm.nih.gov
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: Origin,X-Accept-Charset,X-Accept,Content-Type,X-Requested-With,NCBI-SID,NCBI-PHID
Content-Type: text/html
Set-Cookie: ncbi_sid=0751457F9A561D01_0000SID; domain=.nih.gov; path=/; expires=Wed, 29 Aug 2018 13:38:27 GMT
Vary: Accept-Encoding
Content-Encoding: gzip
X-UA-Compatible: IE=Edge
X-XSS-Protection: 1; mode=block
Keep-Alive: timeout=1, max=9
Connection: Keep-Alive
Transfer-Encoding: chunked

And here is all the header information I get from my script:

Date : Tue, 29 Aug 2017 14:41:08 GMT
Server : Apache
Strict-Transport-Security : max-age=31536000; includeSubDomains; preload
Referrer-Policy : origin-when-cross-origin
Content-Security-Policy : upgrade-insecure-requests
Accept-Ranges : bytes
Vary : Accept-Encoding
Content-Encoding : gzip
X-UA-Compatible : IE=Edge
X-XSS-Protection : 1; mode=block
Content-Length : 2516
Keep-Alive : timeout=1, max=10
Connection : Keep-Alive
Content-Type : text/html

The NCBI-RCGI-RetryURL field is important because it contains the URL I need to execute a GET request on in order to retrieve the results.

EDIT:

Updated script as per Maurice Meyer's suggestion:

#!/usr/bin/env python

import requests

# BaseURL being accessed
url = 'https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi'

# Dictionary of query parameters
data = {
    'INPUT_SEQUENCE' : 'TCTTCTGAGAAAGTCTGAGGCTCCTTAGTACCTTCTCTAGTATGAACTGTTCAGCCTGCCCGCAAGTTGTAACTACGCAGGCGCCAAGACAGCCAACCAAGGAGGCTGCAGA',
    'ORGANISM'       : 'Mus musculus'
}

# Extra headers
headers = {
    'Origin' : 'https://www.ncbi.nlm.nih.gov',
    'Upgrade-Insecure-Requests' : '1',
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36',
    'Content-Type' : 'multipart/form-data; boundary=----WebKitFormBoundaryBflp51Ny9ReeA5A9',
    'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Referer' : 'https://www.ncbi.nlm.nih.gov/tools/primer-blast/index.cgi?LINK_LOC=reset',
    'Accept-Encoding' : 'gzip, deflate, br',
    'Accept-Language' : 'en-US,en;q=0.8',
    'Cookie' : 'sv-userdata-key-www.ncbi.nlm.nih.gov=G5KxXzyQ81U_vs1aHK_7XDWciF1B8AjjDUmDunVbhIZhZ4p4t_SVK4ASpbTT8iDSJVcxBH9oUAB3K2xNWjp3G0koYCloBlYuSxdoIGIkYzl2; ncbi_sid=0751457F9A561D01_0000SID; _ga=GA1.2.567134514.1503994317; _gid=GA1.2.654339584.1503994317; _gat=1; starnext=MYGwlsDWB2CmAeAXAXAbgK7RAewIYBM4lkAmAXgAcAnMAW1ioCMRcBnRAMgBYzm3FWsXFWAALDgEZymHAUkBOMgAYA7AFYpXFQDF5AQTUA2ACIBRFRKVXrN2xI4klygMJcSltQA59R0wGY/S1tg63t3Shp6JhZ2AFI/PQA5AHlE03i9PnZBYTEMlLSHcgB3UoA6aGBGMAqQWgqwUTKAc2wANwceajoGLMR81NMHQwie6P4BtIy+nJFxEhVRqL7J9ISZoTnVh08l3pj+lWcC9KON3NFYo5OHRQkuLiUOPyd5K2eJMnvH5/JPDWefjIADNcCBBM8eIgqOhYM81F83GpniMJH4SPJnosSFxnrtAoZ5Li/IoXp5DCpuE50RIpNxPlIAtxyNBcIgwG04Q8yDI8IQEJwuAiSBw1EDvk81DxPEo/KKEfIRUYyCRDIZRYtJbtaT81HcOIYnE9DAycUA='
}

# Make a POST request and read the response
with requests.session() as session:
    poster = session.post(url, data=data, headers=headers)
    for key, value in poster.headers.items():
        print(key, ':', value)

Updated output, still no difference:

Date : Tue, 29 Aug 2017 15:05:27 GMT
Server : Apache
Strict-Transport-Security : max-age=31536000; includeSubDomains; preload
Referrer-Policy : origin-when-cross-origin
Content-Security-Policy : upgrade-insecure-requests
Accept-Ranges : bytes
Vary : Accept-Encoding
Content-Encoding : gzip
X-UA-Compatible : IE=Edge
X-XSS-Protection : 1; mode=block
Content-Length : 2517
Keep-Alive : timeout=1, max=10
Connection : Keep-Alive
Content-Type : text/html

The request data between the two is totally different.

Specifically the request body data. So it really isn't missing header information using Python's requests library - it is missing information the in POST request to server.

You can't simply copy and paste the header

'Content-Type' : 'multipart/form-data; boundary=----WebKitFormBoundaryBflp51Ny9ReeA5A9',

Or just post the data INPUT_SEQUENCE and ORGANISM like that - also in any case the data you do have for ORGANISM is clearly wrong - a cursory glance shows it would be Mus musculus (taxid:10090) not Mus musculus .

So - you need to look at the whole request - headers and body, then craft a request that includes the required data by the server. Looking at it you are missing loads and loads of data that the server will need to respond.

------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="INPUT_SEQUENCE"

TCTTCTGAGAAAGTCTGAGGCTCCTTAGTACCTTCTCTAGTATGAACTGTTCAGCCTGCCCGCAAGTTGTAACTACGCAGGCGCCAAGACAGCCAACCAAGGAGGCTGCAGA
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SEQFILE"; filename=""
Content-Type: application/octet-stream


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER5_START"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER5_END"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER3_START"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER3_END"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_LEFT_INPUT"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_RIGHT_INPUT"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_PRODUCT_MIN"

70
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_PRODUCT_MAX"

1000
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_NUM_RETURN"

10
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_MIN_TM"

57.0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_OPT_TM"

60.0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_MAX_TM"

63.0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_MAX_DIFF_TM"

3
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_ON_SPLICE_SITE"

0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SPLICE_SITE_OVERLAP_5END"

7
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SPLICE_SITE_OVERLAP_3END"

4
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="MIN_INTRON_SIZE"

1000
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="MAX_INTRON_SIZE"

1000000
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SEARCH_SPECIFIC_PRIMER"

on
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SEARCHMODE"

0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_SPECIFICITY_DATABASE"

refseq_mrna
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="CUSTOM_DB"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="CUSTOMSEQFILE"; filename=""
Content-Type: application/octet-stream


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="ORGANISM"

Mus musculus (taxid:10090)
------WebKitFormBoundaryJVAJqDi2cI4BTfmc

etc...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM