[英]Extract some information in a pdf embedded in a web page using python and requests
[英]Reading Web Page lost some information no matter using Python or Java
当我尝试使用读取某些网页时,页面源如下所示:
<p/><table border="1" align="center" cellpadding="10"><tbody><tr><td><a href="/cgi-bin/query/C?c101:./temp/~c1011jI5AQ" title="Displays without navigation or highlighting">Printer Friendly</a>[<a href="/home/billdwnloadhelp.html">Help</a>]</td>
但是,当我使用Python的urllib2,urllib或读取此网页的请求时,结果与以下内容相同:
<p/><a href="/[<a href="%s">Help</a>]</td>`/C?query:c101" Printer Friendly</a><p/>
因此,为什么我无法阅读所有信息并丢失了非常重要的部分C?c101:./temp/~c1011jI5AQ
?
我试图用Java来阅读,情况是一样的。 而且我尝试使用不同的OS,例如Mac,Linux或Windows,结果也是相同的。 那么我该如何解决这个问题呢?
我不确定是否正确理解: 第二个示例是您从Python或Java获得的内容。 第一个 ? 它是通过使用浏览器查看“源代码”获得的吗? 在这种情况下,可能出现三种情况:
作为测试,您可以使用curl
下载页面并进行一些比较-这将非常适合该任务,因为在许多选择中,您有机会更改提供给服务器的用户代理标识-如此,假装为IE或Firefox或您喜欢的任何内容:
curl(1) Curl Manual curl(1) NAME curl - transfer a URL SYNOPSIS curl [options] [URL...] DESCRIPTION curl is a tool to transfer data from or to a server, using one of the supported protocols (HTTP, HTTPS, FTP, FTPS, SCP, SFTP, TFTP, DICT, TELNET, LDAP or FILE). The command is designed to work without user interaction. curl offers a busload of useful tricks like proxy support, user authen‐ tication, FTP upload, HTTP post, SSL connections, cookies, file trans‐ fer resume and more. As you will see below, the number of features will make your head spin! [...] -A/--user-agent (HTTP) Specify the User-Agent string to send to the HTTP server. Some badly done CGIs fail if this field isn't set to "Mozilla/4.0". To encode blanks in the string, surround the string with single quote marks. This can also be set with the -H/--header option of course. If this option is set more than once, the last one will be the one that's used.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.