简体   繁体   English

如何结合Http标头和读取内容JAVA程序?

[英]How to combine Http header and read content JAVA program?

And I get a program which should be used to get content for html. 而且我得到了一个应该用于获取html内容的程序。

public class University {
    public static void main(String[] args) throws Exception {
        System.out.println("Started");

        URL url = new URL ("http://www.4icu.org/reviews/index2.htm");

        URLConnection spoof = url.openConnection();        
        // Spoof the connection so we look like a web browser
        spoof.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)");

        String connect = url.toString();
        Document doc = Jsoup.connect(connect).get();

        Elements cells = doc.select("td.i");

        Iterator<Element> iterator = cells.iterator();

        while (iterator.hasNext()) {
            Element cell = iterator.next();
            String university = cell.select("a").text();
            String country = cell.nextElementSibling().select("img").attr("alt");

            System.out.printf("country : %s, university : %s %n", country, university);
        }
    }
}

However, there seems to have Http header in blocking to reach the content. 但是,似乎在阻止到达内容的Http标头中。 Thus, I have created the following program to get the header of the html site. 因此,我创建了以下程序来获取html站点的标题。

public class Get_Header {
  public static void main(String[] args) throws Exception {
    URL url = new URL("http://www.4icu.org/reviews/index2.htm");
    URLConnection connection = url.openConnection();

    Map responseMap = connection.getHeaderFields();
    for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();) {
      String key = (String) iterator.next();
      System.out.println(key + " = ");

      List values = (List) responseMap.get(key);
      for (int i = 0; i < values.size(); i++) {
        Object o = values.get(i);
        System.out.println(o + ", ");
      }
    }
  }
}

It retunrs the following result. 它重新调谐以下结果。

X-Frame-Options = 
SAMEORIGIN, 
Transfer-Encoding = 
chunked, 
null = 
HTTP/1.1 403 Forbidden, 
CF-RAY = 
2ca61c7a769b1980-HKG, 
Server = 
cloudflare-nginx, 
Cache-Control = 
max-age=10, 
Connection = 
keep-alive, 
Set-Cookie = 
__cfduid=d4f8d740e0ae0dd551be15e031359844d1469853403; expires=Sun, 30-Jul-17 04:36:43 GMT; path=/; domain=.4icu.org; HttpOnly, 
Expires = 
Sat, 30 Jul 2016 04:36:53 GMT, 
Date = 
Sat, 30 Jul 2016 04:36:43 GMT, 
Content-Type = 
text/html; charset=UTF-8, 

Though I can get the header, but how should I combine the code to form a complete one? 尽管我可以获取标头,但是如何结合代码以形成完整的标头呢?

Great Thanks in Advnace. 非常感谢Advnace。

You can use the Response class to get the page you need, use it to display the headers and then convert it to Document to extract the text you need: 您可以使用Response类来获取所需的页面,使用它来显示页眉,然后将其转换为Document以提取所需的文本:

Connection.Response response = Jsoup.connect("http://www.4icu.org/reviews/index2.htm")
            .userAgent("Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)")
            .method(Connection.Method.GET)
            .followRedirects(false)
            .execute();

Document doc = response.parse();
Elements cells = doc.select("td.i");
Iterator<Element> iterator = cells.iterator();

while (iterator.hasNext()) {
    Element cell = iterator.next();
    String university = cell.select("a").text();
    String country = cell.nextElementSibling().select("img").attr("alt");
    System.out.printf("country : %s, university : %s %n", country, university);
}
System.out.println(response.headers());

The "User-Agent" property which you set on the URL seems to be lost when you convert it back to a String again. 当您再次将其转换回String时,在URL上设置的"User-Agent"属性似乎丢失。

Setting the user-agent on the JSoup connection seems to work: 在JSoup连接上设置用户代理似乎有效:

public static void main(String[] args) throws Exception {
    System.out.println("Started");

    String url = "http://www.4icu.org/reviews/index2.htm";
    Document doc = Jsoup.connect(url).userAgent("Mozilla").get();

    Elements cells = doc.select("td.i");

    Iterator<Element> iterator = cells.iterator();

    while (iterator.hasNext()) {
        Element cell = iterator.next();
        String university = cell.select("a").text();
        String country = cell.nextElementSibling().select("img").attr("alt");

        System.out.printf("country : %s, university : %s %n", country, university);
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM