簡體   English   中英

python3 - 從 url 下載 pdf 文件

[英]python3 - download pdf file from url

我的python3代碼:

import requests

url = sys.argv[1]
r = requests.get(url, stream=True)
chunk_size = 20000
with open('metadata.pdf', 'wb') as fd:
    for chunk in r.iter_content(chunk_size):
        fd.write(chunk)

它保存了 metadat.pdf 中的內容,但這不是 pdf 的真實內容,它是這個 html 頁面:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

<html>
<!-- $HTMLid:   index.html /main/6 11-Jun-2004.13:54:09 $ -->
<head>
<title>Allied Waste</title>

<script language="JavaScript">
<!--
if (top != self) {
        top.location = self.location;
    }
function doRedirect() {
  document.login.submit();
} 

function init () {
    var initChar = /^\?/;
    var list = top.location.search.replace(initChar,"");
    var parms = list.split('&');
    for ( ct=0; ct < parms.length; ct++ ) {
        vals = parms[ct].split('=');
        switch ( vals[0] ) {
            case "unitCode":
                document.login.unitCode.value = unescape(vals[1]);
                if ( document.login.unitCode.value == 'undefined' || document.login.unitCode.value == '' )
                    document.login.unitCode.value = "ALW";
                break;
      default:
        document.login.unitCode.value = "ALW";
                break;
        }
    }
    document.login.submit();
}
//-->
</script>
</head>
<body onload="init()">
  <form name="login" action="inetSrv" method="post">
    <input type="hidden" name="type" value="SignonService"/>
    <input type="hidden" name="action" value="SignonPrompt"/>
    <input type="hidden" name="client" value="701122300"/>
    <input type="hidden" name="unitCode" value=""/>
  </form>
</body>
</html>

任何幫助,我怎樣才能保存文件的真實內容,而不是這個 html? 它應該是真正的 pdf,當我下載它時,它就是這個 html 頁面

更新:

當我使用 python 會話時來自服務器的答案:

b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html>\n\n                                                                                                              \n<head><title></title>\n                     \n<LINK REL="StyleSheet" HREF="styles/mainStyle.css">\n</head>\n\n<body>\n<div style="float: left; border: 1px solid black; background-color: #FFFFFF; padding: 5px">\n\t<div class="TitleFont">Operation failed</div>\n\t<div class="TitleFont">Reason</div>\n\t<div>\n\t<div class="custom-message-box">\n\t\t\t\t<div class="ErrorFont" ALIGN="left" >A server error has occurred.</div>\n\t\t\t\t<div class="ErrorFont" ALIGN="left" >Error reference id: DLY-00716</div>\n\t\t\t\t<div class="ErrorFont" ALIGN="left" >Time: Wed Jul 15 05:33:12 CDT 2020</div>\n\t</div>\n\t</div>\n\t<div style="width: 600px">\n\t\t<p class="form-style-text">\n\t\tIf contacting customer support, please quote the above error reference id. You may be able to press the browser Back button to return to the previous screen. Otherwise you may need to login again. We apologize for the inconvenience.\n\t\t</p>\n\t</div>\n</div>\n\n</body>\n</html>\n\n'

看起來該頁面是對登錄頁面的重定向。 如果可以的話,手動操作可能會更簡單。

否則,您將不得不處理登錄過程以檢索它會給您的身份驗證 cookie(可能),然后您必須將其與get請求一起發送,以使預期的 pdf 可用。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM