简体   繁体   English

使用jsoup解析html并删除标记块

[英]Parse html with jsoup and remove the tag block

I want to remove everything between a tag. 我想删除标签之间的所有内容。 An example input may be 输入的示例可以是

Input: 输入:

<body>
  start
  <div>
    delete from below
    <div class="XYZ">
      first div having this class
      <div>
        waste
      </div>
      <div class="XYZ">
        second div having this class
      </div>
      waste
    </div>
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

The output will be: 输出将是:

<body>
  start
  <div>
    delete from below
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

Basically, I have to remove the entire block for the first occurrence of <div class="XYZ"> 基本上,我必须删除第一次出现<div class="XYZ">的整个块

Thanks, 谢谢,

You better iterate over all elements found. 您最好迭代找到的所有元素。 so you can be shure that 所以你可以这样

  • a.) all elements are removed and a。)删除所有元素
  • b.) there's nothing done if there's no element. b。)如果没有元素,那就什么也没做。

Example: 例:

Document doc = ...

for( Element element : doc.select("div.XYZ") )
{
    element.remove();
}

Edit: 编辑:

( An addition to my comment ) (我的评论的补充)

Don't use exception handling when a simple null- / range check is enough here: 在此处进行简单的零/范围检查时,请勿使用异常处理:

doc.select("div.XYZ").first().remove();

instead: 代替:

Elements divs = doc.select("div.XYZ");

if( !divs.isEmpty() )
{
    /*
     * Here it's safe to call 'first()' since there at least one element.
     */
}

Try this code : 试试这段代码:

String data = null;
    BufferedReader br = new BufferedReader(new FileReader("e://XMLFile.xml"));
    StringBuilder builder = new StringBuilder();
    while ((data = br.readLine()) != null) {
        builder.append(data);
    }
    System.out.println(builder);
    String replaceAll = builder.toString().replaceAll("<div class=\"XYZ\".+?</div>", "");
    System.out.println(replaceAll);

I have read the input XML from a file and stored it in a StringBuilder object by reading it line by line, and then replaced the entire 我已经从文件中读取了输入XML,并通过逐行读取将其存储在StringBuilder对象中,然后替换为整个 tag will empty string. 标签将为空字符串。

This may help you. 这可能对你有所帮助。

 String selectTags="div,li,p,ul,ol,span,table,tr,td,address,em";
 /*selecting some specific tags */
 Elements webContentElements = parsedDoc.select(selectTags); 
 String removeTags = "img,a,form"; 
 /*Removing some tags from selected elements*/
 webContentElements.select(removeTags).remove();

I asked this problem yesterday and thanks to ollo's answer. 昨天我问了这个问题,感谢ollo的回答。 It was solved. 它解决了。 There is en extension of the above problem. 有上述问题的延伸。 I did not know if I have to start a new post or chain this one. 我不知道我是否必须开始一个新的帖子或链接这个。 So, in this confusion I am chaining it here.. Admins pls, pardon me if I had to make a separate post for this. 所以,在这种混乱中,我在这里链接它。管理员请,请原谅我,如果我必须为此单独发布一个帖子。

In the above problem, I have to remove a tag block with matching component. 在上面的问题中,我必须删除带有匹配组件的标记块。

The real scenario is: It should remove the tag block with matching component + remove <br /> surrounding it. 真实的情况是:它应该删除带有匹配组件的标签块+删除它周围的<br />

Referring to the above example. 参考上面的例子。

<body>
  start
  <div>
    delete from below
    <br />
    <br />
    <div class="XYZ">
      first div having this class
      <div>
        waste
      </div>
      <div class="XYZ">
        second div having this class
      </div>
      waste
    </div>
    <br />
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

should also give the same output: 还应该给出相同的输出:

<body>
  start
  <div>
    delete from below
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>

Because it has <br /> above and below the html tag block to remove.... 因为它具有<br />上方和下方的HTML标记块以除去....

Just to re-iterate, I am using the solution given by ollo to match and remove the tag block. 为了重新迭代,我使用ollo给出的解决方案来匹配并删除标记块。

for( Element element : doc.select("div.XYZ") )
{
    element.remove();
}

Thanks, Shekhar 谢谢,谢卡尔

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM