Java - 下载网页源代码的最佳方式 html

Question

我正在写一个小爬虫。 下载 web 页面的源代码 html 的最佳方法是什么？ 我目前正在使用下面的一小段代码，但有时结果只是页面源代码的一半......我不知道有什么问题。 有人建议我应该使用 Jsoup，但是如果使用 Jsoup 中的 get.html() function 也会返回页面源的一半，如果它太长的话。 因为我正在写一个爬虫。 方法支持unicode (UTF-8) 很重要，效率也很重要。 我想知道最好的现代方法，所以我问你们，因为我是 Java 新手。 谢谢。

代码：

public static String downloadPage(String url)
    {
        try
        {
            URL pageURL = new URL(url);
            StringBuilder text = new StringBuilder();
            Scanner scanner = new Scanner(pageURL.openStream(), "utf-8");
            try {
                while (scanner.hasNextLine()){
                    text.append(scanner.nextLine() + NL);
                }
            }
            finally{
                scanner.close();
            }
            return text.toString();
        }
        catch(Exception ex)
        {
            return null;
        }
    }

Answer 1

我使用commons-io String html = IOUtils.toString(url.openStream(), "utf-8");

Answer 2

Personally, I'm very pleased with the Apache HTTP library http://hc.apache.org/httpcomponents-client-ga/ . 如果您正在编写 web 爬虫，我也是，您可能会非常欣赏它对 cookies 和客户端共享等内容的控制。

Java - 下载网页源代码的最佳方式 html

问题描述

2 个解决方案

解决方案1
5 2011-05-02 19:17:00

解决方案2
2 已采纳 2011-05-02 20:34:34

Java - 下载网页源代码的最佳方式 html

问题描述

2 个解决方案

解决方案1 5 2011-05-02 19:17:00

解决方案2 2 已采纳 2011-05-02 20:34:34

解决方案1
5 2011-05-02 19:17:00

解决方案2
2 已采纳 2011-05-02 20:34:34