I have files containing HTML and I am trying to parse that file and then tokenise the text of the body. I achieve this through:
docs = JSOUP.parse("myFile","UTF-8","");
System.out.println(docs.boy().text());
The above codes work fine but the problem is TEXT that is present outside of html tags without any tag is also printed as part of the body tags. I need to find a way to stop this text outside of HTML tags from being read Help this is a time sensitive question !
You can select and remove unwanted elements in your document.
doc.select("body > :matchText").remove();
The above statement will remove all text-nodes, that are direct children of the body-element. The :matchText selector is rather new, so please make sure to use a somehow recent version of JSoup (1.11.3 definitely works, but 1.10.2 not).
Find more infos on the selector syntax on https://jsoup.org/cookbook/extracting-data/selector-syntax
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.