Jsoup仅过滤掉一些从html到文本的标签

Question

can any master of jsoup tell me some suggestions to filter html to text/string? jsoup的高手可以告诉我一些将html过滤为文本/字符串的建议吗？ I've tried calling text() of Document. 我试过调用Document的text（）。 But all tags/elements will be filtered. 但是所有标签/元素都会被过滤。 My aim is to filter some specified tags. 我的目的是过滤一些指定的标签。

ie: I've html text like: 即：我有html文字，例如：

<div>hello<p>world</div>,<table><tr><td>xxx</td></tr>

to get result: 得到结果：

<div>hello<p>world</div>,xxx

which has filtered tags. 其中已过滤标签。

Answer 1

I can't test this right now but I think you want to write a recursive function that steps through the tree and prints each node based on a condition. 我现在无法测试，但是我想您想编写一个递归函数，该函数逐步遍历树并根据条件打印每个节点。 The following is an example of what it might look like but I expect that you will have to modify it to suit your needs more precisely. 以下是其外观的示例，但我希望您必须对其进行修改以更精确地满足您的需求。

Document doc = JSoup.parse(page_text);
recursive_print(doc.head());
recursive_print(doc.body());

...

private static Set<String> ignore = new HashSet<String>(){{
  add("table");
  ...
}};
public static void recursive_print(Element el){
   if(!ignore.contains(el.className()))
     System.out.println(el.html());
   for(Element child : el.children())
     recursive_print(child);
}

Answer 2

You can use Whitelist to achieve this goal. 您可以使用Whitelist来实现此目标。 For example: 例如：

Whitelist whiteList = new Whitelist(); whiteList.addTags("div", "p", "td");

It means that all other tags will be removed. 这意味着所有其他标签将被删除。

Jsoup仅过滤掉一些从html到文本的标签

问题描述

2 个解决方案

解决方案1
0 已采纳 2013-07-07 19:15:54

解决方案2
0 2018-05-17 19:25:21

Jsoup仅过滤掉一些从html到文本的标签

问题描述

2 个解决方案

解决方案1 0 已采纳 2013-07-07 19:15:54

解决方案2 0 2018-05-17 19:25:21

解决方案1
0 已采纳 2013-07-07 19:15:54

解决方案2
0 2018-05-17 19:25:21