如何使用JSOUP解析包含html的文件？

Question

I have files containing HTML and I am trying to parse that file and then tokenise the text of the body. 我有包含HTML的文件，并且尝试解析该文件，然后标记正文文本。 I achieve this through: 我是通过以下方式实现的：

docs = JSOUP.parse("myFile","UTF-8","");
System.out.println(docs.boy().text());

The above codes work fine but the problem is TEXT that is present outside of html tags without any tag is also printed as part of the body tags. 上面的代码可以正常工作，但是问题是出现在html标记之外且没有任何标记的TEXT也被打印为body标记的一部分。 I need to find a way to stop this text outside of HTML tags from being read Help this is a time sensitive question ! 我需要找到一种方法来阻止读取HTML标记之外的此文本。帮助这是一个对时间敏感的问题！

Answer 1

You can select and remove unwanted elements in your document. 您可以选择和删除文档中不需要的元素。

 doc.select("body > :matchText").remove();

The above statement will remove all text-nodes, that are direct children of the body-element. 上面的语句将删除所有text-node，它们是body-element的直接子代。 The :matchText selector is rather new, so please make sure to use a somehow recent version of JSoup (1.11.3 definitely works, but 1.10.2 not). ：matchText选择器是一个相当新的选择，因此请确保使用某种最新版本的JSoup（1.11.3肯定有效，但1.10.2无效）。

Find more infos on the selector syntax on https://jsoup.org/cookbook/extracting-data/selector-syntax 在https://jsoup.org/cookbook/extracting-data/selector-syntax上找到有关选择器语法的更多信息

如何使用JSOUP解析包含html的文件？

问题描述

1 个解决方案

解决方案1
0 2018-09-27 18:35:45

如何使用JSOUP解析包含html的文件？

问题描述

1 个解决方案

解决方案1 0 2018-09-27 18:35:45

解决方案1
0 2018-09-27 18:35:45