简体   繁体   中英

Jsoup not parsing entire html body?

Is there some type of limitation on Jsoups parsing. I have been dealing with memory issues which is another question I have open on this site, but started to realize that I am not even getting all the info I need.

I am using jsoup to parse an html page, its a test page and there is nothing but millions of numbers separated by white space. When I parse it with jsoup, I can get some of the text, but it wont get all of it.

For example, if I have a String text that contains the html from .parse(), it only has half of the numbers in the entire web page. If I go to the web page and grab the last number and call .contains() on the text with the html, it will fail. But if I check .contains() for a number thats half way in the html it passes, what is the meaning of this?

Even more weird, is if I parse the html and write it to text file, the text is empty except for the first few words on the page. The test page basically says "test page" then followed by millions of numbers, and in my text file it only says "test page" with no numbers, but there are 100% numbers because I can call .contains() on the text to check which numbers are there.

    html = (Jsoup.connect(url.toString()).get().html());
            Document doc = Jsoup.parse(html);
            text = (doc.body().text());         
            PrintWriter out = new PrintWriter("filename2.txt");

last relevant test code

edit: Wasted so many hours on this and the answer was as simple as this not able to parse complete html of a url using Jsoup. Basically there is a 1M limit on jsoup so you remove the constraint

I solved the html parsing issue but not the fact that the text wont print to a file :/

Jsoup restricts both, the maxium size of the retrieved document and the time in which it gets it. Your documents seems larger than the default. Therefore you must specify other limits :

html = Jsoup.connect(url.toString())

Setting the maxBodySize and timeout to 0 removes the limit altogether. Note that this might be dangerous and stall your application forever.

Thanks for the answer, it really helped my task. I added the following lines of codes .


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM