简体   繁体   中英

Download a file with Content-Type': 'text/html and Content-Encoding': 'gzip'

I am trying to download a zipped file from the url https://taxiforsure.testrail.net/index.php?/reports/get_html/274 . I have the following information for the site

r = urllib.request.urlopen(url)
   >>> r.headers
   {'Date': 'Tue, 10 Mar 2020 05:40:42 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Content- 
   Length': '2412', 'Connection': 'keep-alive', 'Server': 'Apache', 'Set-Cookie': 'tr_rememberme=deleted; 
   expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0, notificationbar=deleted; expires=Thu, 01-Jan-1970 
   00:00:01 GMT; Max-Age=0; path=/index', 'Expires': 'Sat, 01 Jan 2000 00:00:00 GMT', 'Last-Modified': 
   'Tue, 10 Mar 2020 05:40:42 GMT', 'Cache-Control': 'no-store, no-cache, must-revalidate, max-age=0, 
   post-check=0, pre-check=0', 'Pragma': 'no-cache', 'Vary': 'Accept-Encoding', 'Content-Encoding': 
   'gzip'}

Also the content is as given below

>>> r.content
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\n<head>\n\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge" />\n\t<link href="https://fonts.googleapis.com/css?family=Barlow:400,400i,500,500i,600,600i,700,700i" rel="stylesheet">\n\t<title>Login - TestRail</title>\n\n\n\t\t\t<link type="text/css" rel="stylesheet" href="https://static.testrail.io/6.2.1.1003/css/auth-modern-combined.css" media="all" />\n\t\n\n<link rel="shortcut icon" href="https://static.testrail.io/6.2.1.1003/images/favicon.ico"/>\n\n\n<script type="text/javascript" src="https://static.testrail.io/6.2.1.1003/js/jquery.js"></script>\n</head>\n<body>\n\n    <script type="text/javascript">\n\t\t\t\t$(document).ready(function(){\n\t\t\t\t\t$(\'#name\').focus();\n\t\t\t\t});\n\t\t\t</script>\n<div id="form" class="loginpage-form">\n    <div class="logo loginpage-logo" >\n        <a href="http://www.gurock.com/testrail/" target="_blank" class="logo-loginpage"></a>\n    </div>\n    <div id="form-inner">\n        <h1 class="loginpage-installationname">TestRail QA</h1><style>\n    input:-webkit-autofill {\n        -webkit-box-shadow: 0 0 0px 1000px white inset;\n    }\n</style>\n<div id="content">\n    <h1 class="loginpage-login-text">Log into Your Account</h1>\n    <br/>\n                                                            <noscript>\n        <div class="loginpage-message-title-hint">\n            <div class="hint-alert"><img src="https://static.testrail.io/6.2.1.1003/images/theme-modern/layout/warning-icon.svg" align="left" height="18" width="16"/>\n                <span class="hint-on-top">Warning!</span></div>\n            <div class="error-text"> Javascript is disabled in your web browser. Please enable Javascript, as Javascript is required to use TestRail.</div>\n        </div>\n    </noscript>\n        \n    <form action="index.php?/auth/login/L3JlcG9ydHMvZ2V0X2h0bWwvMjc0LWQ3N2FlMzQyOGYzOTY2YTNkNWU0MTMxNTkxNmRlMTE3MjFlYTI3OGZmZmUwMDBhNzY1MTBjNzk0NmZjYWQ0NDU:" method="post" >\n    \n    \n                        <div style="min-height:24px;"></div>\n            \n    <div class="form-group"  style=\'padding-bottom:10px\';>\n        <div class=\'login-inputx\'>\n            <input id=\'name\' class="login-input " type=\'text\'\n                   name="name" id="name">\n\n                            <label for=\'name\' class="login-label">Email</label>\n                    </div>\n    </div>\n\n    \n    <div class="form-group" style=\'padding-bottom:10px; margin-top: -9px;\'\' >\n        <div class=\'login-inputx\'>\n            <input id=\'password\' class="login-input "\n                   type=\'password\' name="password" id="password" autocomplete=off>\n            <label for=\'password\' class="login-label">Password</label>\n        </div>\n    </div>\n    <div class=\'display-flex\' style=" margin-bottom:40px;">\n        <div style="float:left;">\n                    </div>\n                    <a href="index.php?/auth/forgot_password"\n               class="loginpage-forgotpassword" style="margin-bottom:10px;">\n                Forgot your password?            </a>\n            </div>\n\n            <label class="loginpage-container">\n            Keep me logged in            <input type="checkbox" checked="checked" id="rememberme" name="rememberme"\n                   value="1" checked="checked"/>\n            <span class="loginpage-checkmark"></span>\n        </label>\n    \n        <button id=\'button_primary\' class="loginpage-button-sso-disable loginpage-button-sso-disable-hover  loginpage-button-sso-disable-active">\n        <span class="single-sign-on"> Log In</span>\n    </button>\n\n    </form>\n    </div>\n\t</div>\n<br/>\n<span class="loginpage-version">v6.2.1.1003</span>\n</div>\n\n\n\t\t\t<script type="text/javascript" src="https://static.testrail.io/6.2.1.1003/js/extensions-combined.js"></script>\n\t\t<script type="text/javascript" src="https://static.testrail.io/6.2.1.1003/js/application-combined.js"></script>\n\t\n<script type="text/javascript">\n$(document).ready(function()\n{\n\t\tApp.Translations.add(\n\t\t"timespans_hour_short",\n\t\t"h"\t);\n\t\tApp.Translations.add(\n\t\t"timespans_minute_short",\n\t\t"m"\t);\n\t\tApp.Translations.add(\n\t\t"timespans_second_short",\n\t\t"s"\t);\n\t});\n</script>\n\n\n</body>\n</html>\n    <script type="text/javascript">\n        var browser = function() {\n            // Return cached result if avalible, else get result then cache it.\n            if (browser.prototype._cachedResult)\n                return browser.prototype._cachedResult;\n        \n            // Opera 8.0+\n            var isOpera = (!!window.opr && !!opr.addons) || !!window.opera || navigator.userAgent.indexOf(\' OPR/\') >= 0;\n        \n            // Firefox 1.0+\n            var isFirefox = typeof InstallTrigger !== \'undefined\';\n        \n            // Safari 3.0+ "[object HTMLElementConstructor]"\n            var isSafari = /constructor/i.test(window.HTMLElement) || (function (p) { return p.toString() === "[object SafariRemoteNotification]"; })(!window[\'safari\'] || safari.pushNotification);\n        \n            // Internet Explorer 6-11\n            var isIE = /*@cc_on!@*/false || !!document.documentMode;\n        \n            // Edge 20+\n            var isEdge = !isIE && !!window.StyleMedia;\n        \n            // Chrome 1+\n            var isChrome = !!window.chrome && !!window.chrome.webstore;\n        \n            // Blink engine detection\n            var isBlink = (isChrome || isOpera) && !!window.CSS;\n        \n            return browser.prototype._cachedResult =\n                isOpera ? \'Opera\' :\n                isFirefox ? \'Firefox\' :\n                isSafari ? \'Safari\' :\n                isChrome ? \'Chrome\' :\n                isIE ? \'IE\' :\n                isEdge ? \'Edge\' :\n                isBlink ? \'Blink\' :\n                "Don\'t know";\n        };\n        \n        $(\'input[type=password]\').val(\'\');\n        if(browser() == \'Edge\'){\n            $("#password").removeAttr("autocomplete");\n        }\n        if(browser() != \'IE\' && browser() != \'Edge\'){\n            $("#password").attr("autocomplete","new-password");\n        }\n\n        $(\'.login-input\').on(\'blur change\',function () {\n\n            var $label, $this, $value;\n            $this = $(this);\n            $label = $this.siblings("login-label");\n            $value = $this.val();\n            $label.removeClass("label-active");\n            if ($value !== "") {\n                return $this.addClass("input-notempty");\n            } else {\n                return $this.removeClass("input-notempty");\n            }\n        });\n    </script>\n'

I tried to download the file using

url = 'http://example.com/'
response = urllib.request.urlopen(url)
data = response.read() 

and also

import wget 
wget.download(url)

both return the values as above "r.content"

When I paste the above link in browser it downloads a zip file.

As this page also contains a authentication. I also tried to use the requests_html module and use the existing session in the code below

from requests_html import HTMLSession
session = HTMLSession()
r = session.get("https://taxiforsure.testrail.net/index.php?/reports/get_html/274")

But still the content is same as r.content above.

The data is behind an authentication scheme of some sort. Since your browser is authenticated (I assume) it works fine, but urllib and wget are not authenticated, therefore what they get is a page requesting authentication.

You need to look at the documentation for this testrail thing and find out if there is an official way to have programmatic access to instances (eg official API and API keys, that sort of thing). If there is, use that.

If there is not, you'll probably need to emulate the browser from Python, either by hand (via urllib and cookiejars and such) or using a web scraping system like scrapy, I assume access to authentication-protected resources is a common problem in that space.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM