简体   繁体   English

如何使用JSOUP解析包含html的文件?

[英]How to parse a file containing html using JSOUP?

I have files containing HTML and I am trying to parse that file and then tokenise the text of the body. 我有包含HTML的文件,并且尝试解析该文件,然后标记正文文本。 I achieve this through: 我是通过以下方式实现的:

docs = JSOUP.parse("myFile","UTF-8","");
System.out.println(docs.boy().text());

The above codes work fine but the problem is TEXT that is present outside of html tags without any tag is also printed as part of the body tags. 上面的代码可以正常工作,但是问题是出现在html标记之外且没有任何标记的TEXT也被打印为body标记的一部分。 I need to find a way to stop this text outside of HTML tags from being read Help this is a time sensitive question ! 我需要找到一种方法来阻止读取HTML标记之外的此文本。帮助这是一个对时间敏感的问题!

You can select and remove unwanted elements in your document. 您可以选择和删除文档中不需要的元素。

 doc.select("body > :matchText").remove();

The above statement will remove all text-nodes, that are direct children of the body-element. 上面的语句将删除所有text-node,它们是body-element的直接子代。 The :matchText selector is rather new, so please make sure to use a somehow recent version of JSoup (1.11.3 definitely works, but 1.10.2 not). :matchText选择器是一个相当新的选择,因此请确保使用某种最新版本的JSoup(1.11.3肯定有效,但1.10.2无效)。

Find more infos on the selector syntax on https://jsoup.org/cookbook/extracting-data/selector-syntax https://jsoup.org/cookbook/extracting-data/selector-syntax上找到有关选择器语法的更多信息

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM