简体   繁体   English

Jsoup仅过滤掉一些从html到文本的标签

[英]Jsoup filter out only some tags from html to text

can any master of jsoup tell me some suggestions to filter html to text/string? jsoup的高手可以告诉我一些将html过滤为文本/字符串的建议吗? I've tried calling text() of Document. 我试过调用Document的text()。 But all tags/elements will be filtered. 但是所有标签/元素都会被过滤。 My aim is to filter some specified tags. 我的目的是过滤一些指定的标签。

ie: I've html text like: 即:我有html文字,例如:

<div>hello<p>world</div>,<table><tr><td>xxx</td></tr>

to get result: 得到结果:

<div>hello<p>world</div>,xxx 

which has filtered tags. 其中已过滤标签。

I can't test this right now but I think you want to write a recursive function that steps through the tree and prints each node based on a condition. 我现在无法测试,但是我想您想编写一个递归函数,该函数逐步遍历树并根据条件打印每个节点。 The following is an example of what it might look like but I expect that you will have to modify it to suit your needs more precisely. 以下是其外观的示例,但我希望您必须对其进行修改以更精确地满足您的需求。

Document doc = JSoup.parse(page_text);
recursive_print(doc.head());
recursive_print(doc.body());

...

private static Set<String> ignore = new HashSet<String>(){{
  add("table");
  ...
}};
public static void recursive_print(Element el){
   if(!ignore.contains(el.className()))
     System.out.println(el.html());
   for(Element child : el.children())
     recursive_print(child);
}

You can use Whitelist to achieve this goal. 您可以使用Whitelist来实现此目标。 For example: 例如:

Whitelist whiteList = new Whitelist();
whiteList.addTags("div", "p", "td");

It means that all other tags will be removed. 这意味着所有其他标签将被删除。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM