How to parse a file containing html using JSOUP?

Question

I have files containing HTML and I am trying to parse that file and then tokenise the text of the body. I achieve this through:

docs = JSOUP.parse("myFile","UTF-8","");
System.out.println(docs.boy().text());

The above codes work fine but the problem is TEXT that is present outside of html tags without any tag is also printed as part of the body tags. I need to find a way to stop this text outside of HTML tags from being read Help this is a time sensitive question !

Answer 1

You can select and remove unwanted elements in your document.

 doc.select("body > :matchText").remove();

The above statement will remove all text-nodes, that are direct children of the body-element. The :matchText selector is rather new, so please make sure to use a somehow recent version of JSoup (1.11.3 definitely works, but 1.10.2 not).

Find more infos on the selector syntax on https://jsoup.org/cookbook/extracting-data/selector-syntax

How to parse a file containing html using JSOUP?

Question

1 answers

solution1
0 2018-09-27 18:35:45

How to parse a file containing html using JSOUP?

Question

1 answers

solution1 0 2018-09-27 18:35:45

solution1
0 2018-09-27 18:35:45