繁体   English   中英

如何使用python请求下载文件(perl cgi后端)

[英]How to download a file (perl cgi backend) using python requests

我一直在开发python脚本,以从网络服务器下载csv文件。 我通常的方法是右键单击网页,转到“检查元素”(Chrome),切换到网络视图,然后单击链接以查看流量​​。 我原本希望看到类似“ https://domain.com/file_i_need.csv ”的内容,但是我得到的却是perl脚本的位置。 由于我不完全了解它的工作原理,因此我只复制了curl命令(右键单击相关的网络流量,然后单击“复制为卷曲”)。 因此,我最初只是向os.system()发出了curl命令。 然后,一旦我开始工作,我便尝试修改脚本以使用pycurl。 现在,我想将其更改为使用请求库(主要是为了保持优雅/简洁)。 我已经看到了这个问题的答案,但是我想知道是否存在其他方法,因为后端与预期的略有不同。 我看到推荐使用urllib.urlretreive()作为替代方案,但我想这在这里不起作用。

问题:如何从Web服务器下载文件,而用于生成文件的http是perl脚本?

即https :: //domain.com/file_maker.pl?param1 = 12345

curl命令:“ curl” https://release.domain.com/release_cr_new.pl?releaseid=26851&v=2&m=a&dump_csv=1 “ -H”接受编码:gzip,deflate,sdch“ -H”主机:发布.domain.com“ -H”接受语言:en-US,en; q = 0.8“ -H”用户代理:Mozilla / 5.0(Macintosh; Intel Mac OS X 10_8_4)AppleWebKit / 537.36(KHTML,如Gecko) Chrome / 27.0.1453.116 Safari / 537.36“ -H”接受:text / html,application / xhtml + xml,application / xml; q = 0.9, / ; q = 0.8“ -H” 引荐来源https://release.domain ?的.com / release_cr_new.html releaseid = 26851&v = 2&米=一个 “-H”曲奇:releasegroup =发展; XR77 = 3q3pzeMQc1gf-jDlpNtkgr4WvZYqxVZSYzeQHfGAwMTAeZQ6D3g2e6w; __utma = 147924903.423899313.1373397746.1378841205.1380290587.15; __utmc = 147924903; __utmz = 147924903.1380290587.15.14.utmcsr =谷歌| utmccn =(有机)| utmcmd =有机| utmctr =(未%20provided); pubcookie_s_release.domain.com = Hm17WT1VJbPpBLOQ + NhtyBbZlfO9qntsoGP0P8BEVeh4d0ay + THE3EkNLc6PV5rJ40Ui7uj / + c6f2tzZYWOJ / J + dyoP5l + J // rL875K9ERxio1FZeiUVRQgeabetZ + V1AWlrkjURmAw2SU1hEz / f2pCt0sHe06C14vWA95PFu 1Smp6viWOL8QnaPHFWhGU3uQQH5Wxex0CziHbrYXHuKwnxwWejvVtTM8e8aIHkM2WuB3IIDhGMVtd0r292owvcv6Rvcl7tYSoQaQYfSpPZreXo4tNO9gh9ZIGqao8LaCfG5Fw8 + Ow5wQKf2ryVuPc8Ah4MTIzC1UeZxBtxSTyZk5E1in7LCV9E + d / 5G84U + ECcdn166gJg1iMG68II81YJO9fYs91gGtA5iUa6h3RpFo + ysBkqbHjCpetOUxfHh47sdr4nUoIWEb0LfKVTYfvmW6BNGx4m90PqE8aQlknv7zxqAQrujqe7h5zSpmaD5UjrfRwp7lYD + 6e88vgQzLgWlcAA =; _session_id = eb0095f849a509c3cf65b43680b3002a; default_column_2 = bugid%2Cloginname%2Ccomponent%2Cversionvalue%2Cbugdate%2Cshortdescription%2Cpriority%2Cstatus%2Cqacontact%2Csqa_status%2Cis_dep“ -H”连接:保持活动“

很抱歉提供大量文字。

如果要从服务器流式传输数据:

# UNTESTED
import requests
import csv

# Connect to the web server.
response = requests.get("https:://domain.com/file_maker.pl?param1=12345", stream=True)
# Read the data as CSV
data = csv.reader(response.raw)

# Use the data
for line in data:
  print line

或者,如果要从服务器下载文件并将其存储在本地:

# UNTESTED
import requests
import csv

# Connect to the web server.
response = requests.get("https:://domain.com/file_maker.pl?param1=12345")

# Store the data
with open('outfile', 'w') as outfile:
    outfile.write(response.content)

在您的特定情况下,CGI脚本需要一些特定的标头或cookie才能返回正确的数据。 我不知道它需要哪个标题或cookie,所以只需将它们全部发送:

url = "https://release.domain.com/release_cr_new.plreleaseid=26851&v=2&m=a&dump_csv=1"
headers = {
  "Accept-Encoding" : "gzip,deflate,sdch",
  "Accept-Language" : "en-US,en;q=0.8",
  "User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36",
  "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  "Referer" : "https://release.domain.com/release_cr_new.html?releaseid=26851&v=2&m=a",
  "Cookie" : "releasegroup=Development; XR77=3q3pzeMQc1gf-jDlpNtkgr4WvZYqxVZSYzeQHfGAwMTAeZQ6D3g2e6w; __utma=147924903.423899313.1373397746.1378841205.1380290587.15; __utmc=147924903; __utmz=147924903.1380290587.15.14.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); pubcookie_s_release.domain.com=Hm17WT1VJbPpBLOQ+NhtyBbZlfO9qntsoGP0P8BEVeh4d0ay+THE3EkNLc6PV5rJ40Ui7uj/+c6f2tzZYWOJ/j+dyoP5l+J//rL875K9ERxio1FZeiUVRQgeabetZ+V1AWlrkjURmAw2SU1hEz/f2pCt0sHe06C14vWA95PFu1Smp6viWOL8QnaPHFWhGU3uQQH5Wxex0CziHbrYXHuKwnxwWejvVtTM8e8aIHkM2WuB3IIDhGMVtd0r292owvcv6Rvcl7tYSoQaQYfSpPZreXo4tNO9gh9ZIGqao8LaCfG5Fw8+Ow5wQKf2ryVuPc8Ah4MTIzC1UeZxBtxSTyZk5E1in7LCV9E+d/5G84U+ECcdn166gJg1iMG68II81YJO9fYs91gGtA5iUa6h3RpFo+ysBkqbHjCpetOUxfHh47sdr4nUoIWEb0LfKVTYfvmW6BNGx4m90PqE8aQlknv7zxqAQrujqe7h5zSpmaD5UjrfRwp7lYD+6e88vgQzLgWlcAA=; _session_id=eb0095f849a509c3cf65b43680b3002a; default_column_2=bugid%2Cloginname%2Ccomponent%2Cversionvalue%2Cbugdate%2Cshortdescription%2Cpriority%2Cstatus%2Cqacontact%2Csqa_status%2Cis_dep"
}

response = requests.get(url, headers=headers)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM