简体   繁体   中英

How to parse a file containing html using JSOUP?

I have files containing HTML and I am trying to parse that file and then tokenise the text of the body. I achieve this through:

docs = JSOUP.parse("myFile","UTF-8","");
System.out.println(docs.boy().text());

The above codes work fine but the problem is TEXT that is present outside of html tags without any tag is also printed as part of the body tags. I need to find a way to stop this text outside of HTML tags from being read Help this is a time sensitive question !

You can select and remove unwanted elements in your document.

 doc.select("body > :matchText").remove();

The above statement will remove all text-nodes, that are direct children of the body-element. The :matchText selector is rather new, so please make sure to use a somehow recent version of JSoup (1.11.3 definitely works, but 1.10.2 not).

Find more infos on the selector syntax on https://jsoup.org/cookbook/extracting-data/selector-syntax

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM