简体   繁体   English

Java - 下载网页源代码的最佳方式 html

[英]Java - Best way to download a webpage's source html

I'm writing a little crawler.我正在写一个小爬虫。 What is the best way to download a web page's source html?下载 web 页面的源代码 html 的最佳方法是什么? I'm currently using little piece of code below but some times the result is just half of the page's source... I don't know what's the problem.我目前正在使用下面的一小段代码,但有时结果只是页面源代码的一半......我不知道有什么问题。 Some people suggested that I should use Jsoup but using,get.html() function from Jsoup also returns half of the page's source if it's too long.有人建议我应该使用 Jsoup,但是如果使用 Jsoup 中的 get.html() function 也会返回页面源的一半,如果它太长的话。 Since I'm writing a crawler.因为我正在写一个爬虫。 it's very important that the method support unicode (UTF-8) and the efficiency is also very important.方法支持unicode (UTF-8) 很重要,效率也很重要。 I wanted to know the best modern way to do it so I asked you guys since I'm new to Java.我想知道最好的现代方法,所以我问你们,因为我是 Java 新手。 Thanks.谢谢。

Code:代码:

public static String downloadPage(String url)
    {
        try
        {
            URL pageURL = new URL(url);
            StringBuilder text = new StringBuilder();
            Scanner scanner = new Scanner(pageURL.openStream(), "utf-8");
            try {
                while (scanner.hasNextLine()){
                    text.append(scanner.nextLine() + NL);
                }
            }
            finally{
                scanner.close();
            }
            return text.toString();
        }
        catch(Exception ex)
        {
            return null;
        }
    }

I use commons-io String html = IOUtils.toString(url.openStream(), "utf-8");我使用commons-io String html = IOUtils.toString(url.openStream(), "utf-8");

Personally, I'm very pleased with the Apache HTTP library http://hc.apache.org/httpcomponents-client-ga/ . Personally, I'm very pleased with the Apache HTTP library http://hc.apache.org/httpcomponents-client-ga/ . If you're writing a web crawler, which I am also, you may greatly appreciate the control it gives over things like cookies and client sharing and the like.如果您正在编写 web 爬虫,我也是,您可能会非常欣赏它对 cookies 和客户端共享等内容的控制。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM