简体   繁体   English

无论使用Python还是Java,读取网页都会丢失一些信息

[英]Reading Web Page lost some information no matter using Python or Java

When I try to use read some web page, the page source looks like as following: 当我尝试使用读取某些网页时,页面源如下所示:

<p/><table border="1" align="center" cellpadding="10"><tbody><tr><td><a href="/cgi-bin/query/C?c101:./temp/~c1011jI5AQ" title="Displays without navigation or highlighting">Printer Friendly</a>[<a href="/home/billdwnloadhelp.html">Help</a>]</td>

But when I use Python's urllib2, urllib or requests to read this web page, the results are the same as following: 但是,当我使用Python的urllib2,urllib或读取此网页的请求时,结果与以下内容相同:

<p/><a href="/[<a href="%s">Help</a>]</td>`/C?query:c101" Printer Friendly</a><p/>

So, why I cannot read all the information and lost the very important part C?c101:./temp/~c1011jI5AQ ??? 因此,为什么我无法阅读所有信息并丢失了非常重要的部分C?c101:./temp/~c1011jI5AQ

I tried to use Java to read, it is the same situation. 我试图用Java来阅读,情况是一样的。 And I try to use different OS, like Mac, Linux or Windows, it is also the same result. 而且我尝试使用不同的OS,例如Mac,Linux或Windows,结果也是相同的。 So how can I solve this problem? 那么我该如何解决这个问题呢?

I'm not sure to understand correctly: the second example is what you get either with Python or Java. 我不确定是否正确理解: 第二个示例是您从Python或Java获得的内容。 And the first ? 一个 Is it obtained by looking at "source code" with a browser? 它是通过使用浏览器查看“源代码”获得的吗? In that case, three possible scenarios: 在这种情况下,可能出现三种情况:

  • First (and less likely), the "view source code" of your browser display source modified/altered/generated by JavaScript 首先(不太可能),浏览器的“查看源代码”显示由JavaScript修改/更改/生成的源
  • Second, the server generate different content based on the "client signature" (formally, user-agent identification ) 其次,服务器根据“客户端签名”(通常是用户代理标识 )生成不同的内容
  • Third, the server provide different content based on the cookies stored on your browser 第三,服务器根据您浏览器中存储的cookie提供不同的内容

As a test, you might use curl to download page and do some comparisons -- it will be perfectly suited for that task since, among many options, you have the opportunity to change the user-agent identification provided to the server -- and so, pretending to be IE or Firefox or whatever you like: 作为测试,您可以使用curl下载页面并进行一些比较-这将非常适合该任务,因为在许多选择中,您有机会更改提供给服务器的用户代理标识-如此,假装为IE或Firefox或您喜欢的任何内容:

curl(1)                           Curl Manual                          curl(1)

NAME
       curl - transfer a URL

SYNOPSIS
       curl [options] [URL...]

DESCRIPTION
       curl  is  a tool to transfer data from or to a server, using one of the
       supported protocols (HTTP, HTTPS, FTP, FTPS,  SCP,  SFTP,  TFTP,  DICT,
       TELNET,  LDAP  or  FILE).  The command is designed to work without user
       interaction.

       curl offers a busload of useful tricks like proxy support, user authen‐
       tication,  FTP upload, HTTP post, SSL connections, cookies, file trans‐
       fer resume and more. As you will see below, the number of features will
       make your head spin!

[...]

      -A/--user-agent 
              (HTTP) Specify the User-Agent string to send to the HTTP server.
              Some   badly   done  CGIs  fail  if  this  field  isn't  set  to
              "Mozilla/4.0". To encode blanks  in  the  string,  surround  the
              string  with  single  quote marks. This can also be set with the
              -H/--header option of course.

              If this option is set more than once, the last one will  be  the
              one that's used.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM