简体   繁体   English

使用 JSOUP 读取 HTML 页面并创建文本文件

[英]Reading HTML Page and creating a text file using JSOUP

I am trying to read the IMDB list of top 50 movies.我正在尝试阅读前 50 部电影的 IMDB 列表。 The code is working fine, but it stops reading at number 43 out of a list of 50.代码运行良好,但它在 50 个列表中的第 43 个停止读取。

public class FetchData {

public static void main(String[] args) {
    try {
        // TODO code application logic here
        Document doc = Jsoup.connect("https://www.imdb.com/list/ls053181721/").userAgent("Mozilla/17.0").get();
        Elements temp = doc.select("div.lister-item-content");

        int i=0;
        File file = new File("C:\\Demo Java\\IMDBList.txt");
        FileWriter writer = new FileWriter(file);
        for(Element movieList : temp) {
            i++;
            System.out.println(i+" "+movieList.getElementsByTag("a").first().text());
            writer.write(+i+". "+movieList.getElementsByTag("a").first().text().toString()+"\n");

        }
        writer.close();
    } catch (IOException ex) {
        Logger.getLogger(FetchData.class.getName()).log(Level.SEVERE, null, ex);
    }

}

The html document you are loading through Jsoup does not load entirely, as the it exceeds the default maximum body size of 1MB.您通过 Jsoup 加载的 html 文档不会完全加载,因为它超过了默认的最大正文大小 1MB。 You need to increase the maximum allowed body size of the request in order to load the complete document.您需要增加请求的最大允许正文大小才能加载完整的文档。

Document doc = Jsoup.connect("https://www.imdb.com/list/ls053181721/")
                    .userAgent("Mozilla/17.0")
                    .maxBodySize(0)
                    .get();

Note: Adding maxBodySize(0) allows unlimited size.注意:添加maxBodySize(0)允许无限大小。

Please Refer: https://jsoup.org/apidocs/org/jsoup/Connection.html#maxBodySize-int-请参考: https : //jsoup.org/apidocs/org/jsoup/Connection.html#maxBodySize-int-

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM