使用jsoup解析html并删除标记块

Question

I want to remove everything between a tag. 我想删除标签之间的所有内容。 An example input may be 输入的示例可以是

Input: 输入：

<body>
  start
  <div>
    delete from below
    <div class="XYZ">
      first div having this class
      <div>
        waste
      </div>
      <div class="XYZ">
        second div having this class
      </div>
      waste
    </div>
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

The output will be: 输出将是：

<body>
  start
  <div>
    delete from below
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

Basically, I have to remove the entire block for the first occurrence of <div class="XYZ"> 基本上，我必须删除第一次出现<div class="XYZ">的整个块

Thanks, 谢谢，

Answer 1

You better iterate over all elements found. 您最好迭代找到的所有元素。 so you can be shure that 所以你可以这样

a.) all elements are removed and a。）删除所有元素
b.) there's nothing done if there's no element. b。）如果没有元素，那就什么也没做。

Example: 例：

Document doc = ...

for( Element element : doc.select("div.XYZ") )
{
    element.remove();
}

Edit: 编辑：

( An addition to my comment ) （我的评论的补充）

Don't use exception handling when a simple null- / range check is enough here: 在此处进行简单的零/范围检查时，请勿使用异常处理：

doc.select("div.XYZ").first().remove();

instead: 代替：

Elements divs = doc.select("div.XYZ");

if( !divs.isEmpty() )
{
    /*
     * Here it's safe to call 'first()' since there at least one element.
     */
}

Answer 2

Try this code : 试试这段代码：

String data = null;
    BufferedReader br = new BufferedReader(new FileReader("e://XMLFile.xml"));
    StringBuilder builder = new StringBuilder();
    while ((data = br.readLine()) != null) {
        builder.append(data);
    }
    System.out.println(builder);
    String replaceAll = builder.toString().replaceAll("<div class=\"XYZ\".+?</div>", "");
    System.out.println(replaceAll);

I have read the input XML from a file and stored it in a StringBuilder object by reading it line by line, and then replaced the entire 我已经从文件中读取了输入XML，并通过逐行读取将其存储在StringBuilder对象中，然后替换为整个 tag will empty string. 标签将为空字符串。

Answer 3

This may help you. 这可能对你有所帮助。

 String selectTags="div,li,p,ul,ol,span,table,tr,td,address,em";
 /*selecting some specific tags */
 Elements webContentElements = parsedDoc.select(selectTags); 
 String removeTags = "img,a,form"; 
 /*Removing some tags from selected elements*/
 webContentElements.select(removeTags).remove();

Answer 4

I asked this problem yesterday and thanks to ollo's answer. 昨天我问了这个问题，感谢ollo的回答。 It was solved. 它解决了。 There is en extension of the above problem. 有上述问题的延伸。 I did not know if I have to start a new post or chain this one. 我不知道我是否必须开始一个新的帖子或链接这个。 So, in this confusion I am chaining it here.. Admins pls, pardon me if I had to make a separate post for this. 所以，在这种混乱中，我在这里链接它。管理员请，请原谅我，如果我必须为此单独发布一个帖子。

In the above problem, I have to remove a tag block with matching component. 在上面的问题中，我必须删除带有匹配组件的标记块。

The real scenario is: It should remove the tag block with matching component + remove <br /> surrounding it. 真实的情况是：它应该删除带有匹配组件的标签块+删除它周围的<br /> 。

Referring to the above example. 参考上面的例子。

<body>
  start
  <div>
    delete from below
    <br />
    <br />
    <div class="XYZ">
      first div having this class
      <div>
        waste
      </div>
      <div class="XYZ">
        second div having this class
      </div>
      waste
    </div>
    <br />
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

should also give the same output: 还应该给出相同的输出：

<body>
  start
  <div>
    delete from below
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

Because it has <br /> above and below the html tag block to remove.... 因为它具有<br />上方和下方的HTML标记块以除去....

Just to re-iterate, I am using the solution given by ollo to match and remove the tag block. 为了重新迭代，我使用ollo给出的解决方案来匹配并删除标记块。

for( Element element : doc.select("div.XYZ") )
{
    element.remove();
}

Thanks, Shekhar 谢谢，谢卡尔

使用jsoup解析html并删除标记块

问题描述

4 个解决方案

解决方案1
15 已采纳 2013-04-03 19:18:12

解决方案2
1 2013-04-03 19:12:04

解决方案3
1 2017-10-19 10:34:14

解决方案4
0 2013-04-05 18:25:39

使用jsoup解析html并删除标记块

问题描述

4 个解决方案

解决方案1 15 已采纳 2013-04-03 19:18:12

解决方案2 1 2013-04-03 19:12:04

解决方案3 1 2017-10-19 10:34:14

解决方案4 0 2013-04-05 18:25:39

解决方案1
15 已采纳 2013-04-03 19:18:12

解决方案2
1 2013-04-03 19:12:04

解决方案3
1 2017-10-19 10:34:14

解决方案4
0 2013-04-05 18:25:39