[英]Extract some information in a pdf embedded in a web page using python and requests
[英]Reading Web Page lost some information no matter using Python or Java
當我嘗試使用讀取某些網頁時,頁面源如下所示:
<p/><table border="1" align="center" cellpadding="10"><tbody><tr><td><a href="/cgi-bin/query/C?c101:./temp/~c1011jI5AQ" title="Displays without navigation or highlighting">Printer Friendly</a>[<a href="/home/billdwnloadhelp.html">Help</a>]</td>
但是,當我使用Python的urllib2,urllib或讀取此網頁的請求時,結果與以下內容相同:
<p/><a href="/[<a href="%s">Help</a>]</td>`/C?query:c101" Printer Friendly</a><p/>
因此,為什么我無法閱讀所有信息並丟失了非常重要的部分C?c101:./temp/~c1011jI5AQ
?
我試圖用Java來閱讀,情況是一樣的。 而且我嘗試使用不同的OS,例如Mac,Linux或Windows,結果也是相同的。 那么我該如何解決這個問題呢?
我不確定是否正確理解: 第二個示例是您從Python或Java獲得的內容。 第一個 ? 它是通過使用瀏覽器查看“源代碼”獲得的嗎? 在這種情況下,可能出現三種情況:
作為測試,您可以使用curl
下載頁面並進行一些比較-這將非常適合該任務,因為在許多選擇中,您有機會更改提供給服務器的用戶代理標識-如此,假裝為IE或Firefox或您喜歡的任何內容:
curl(1) Curl Manual curl(1) NAME curl - transfer a URL SYNOPSIS curl [options] [URL...] DESCRIPTION curl is a tool to transfer data from or to a server, using one of the supported protocols (HTTP, HTTPS, FTP, FTPS, SCP, SFTP, TFTP, DICT, TELNET, LDAP or FILE). The command is designed to work without user interaction. curl offers a busload of useful tricks like proxy support, user authen‐ tication, FTP upload, HTTP post, SSL connections, cookies, file trans‐ fer resume and more. As you will see below, the number of features will make your head spin! [...] -A/--user-agent (HTTP) Specify the User-Agent string to send to the HTTP server. Some badly done CGIs fail if this field isn't set to "Mozilla/4.0". To encode blanks in the string, surround the string with single quote marks. This can also be set with the -H/--header option of course. If this option is set more than once, the last one will be the one that's used.
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.