简体   繁体   English

使用模式匹配无法删除Java中所有出现的html标签

[英]cant remove all occurrence of html tag in java using pattern matching

I have very long html string which has multiple 我有很长的html字符串,其中有多个

             <dl id="divmap"> .... </dl>.

I want to remove all content between this . 我要删除这之间的所有内容。

i wrote this code in java: 我用Java编写了这段代码:

                                   String triphtml= htmlString;
                System.out.println("triphtml is "+triphtml);

                System.out.println("test1 ");
                final Pattern pattern = Pattern.compile("(<dl id=\""+selectedArray[i]+"\">)(.+?)(</dl>)",
                        Pattern.DOTALL);
                final Matcher matcher = pattern.matcher(triphtml);
                // matcher.find();
                System.out.println("pattern of test1 is : "
                        + pattern); // Prints
                System.out.println("MATCHER of test1 is : "
                        + matcher); // Prints
                System.out.println("MATCH COUNT of test1 a: "
                        + matcher.groupCount()); // Prints
                System.out.println("MATCH COUNT of test1  a: "
                        + matcher.find()); // Prints
                while (matcher.find()) {
                    // System.out.println("MATCH GP 3: "+matcher.group(3).substring(1,10));

                    for (int z = 0; z <= matcher.groupCount(); z++) {
                        String extstr = matcher.group(z);
                        System.out.println("matcher group of "+z+" test1  is " + extstr);
                        System.out.println("ext a of test1  is " + extstr);
                        triphtml = triphtml.replaceAll(extstr, "");
                        System.out.println("Group found of test1 is :\n" + extstr);
                    }

                }

But this code removes some dl and some remains in triphtml. 但是此代码删除了一些dl,而一些残留在triphtml中。 I dont why this thing is happening. 我不知道为什么这件事正在发生。 Here triphtml is a html string which has multiple dl's. 这里的triphtml是具有多个dl的html字符串。 Please help me how I remove content between all 请帮我删除所有内容

    <dl id="divmap">.

Thanks in advance. 提前致谢。

I suggest to NOT use regex for html. 我建议不要将正则表达式用于html。 Just use any library used for traversing xml/html. 只需使用用于遍历xml / html的任何库即可。

For example JSoup 例如JSoup

By using regex you can do as follows: 通过使用正则表达式,您可以执行以下操作:

String orgString = "<dl id=\"divmap\"> .... </dl>";

orgString = orgString.replaceAll("<[^>]*>", "");
//for removing html tag

orgString = orgString.replaceAll(orgString.replaceAll("<[^>]*>", ""),"");
//for removing content inside html tag

But it is better to use html parsing 但是最好使用html解析

Edit : 编辑

String htmlString = "<dl id=\"divmap\"> Content </dl>";
Pattern p = Pattern.compile("<[^>]*>");
Matcher m = p.matcher(htmlString);
while(m.find()){
    htmlString = htmlString.replaceAll(m.group(), "");
}
System.out.println("Ans"+htmlString);

Try using JSoup 尝试使用JSoup

It uses selectors and syntax like JQuery, it it very easy to use. 它使用选择器和类似JQuery的语法,它非常易于使用。

You can try this 你可以试试这个

String triphtml = htmlString;

Document doc = Jsoup.parse(htmlString);
Elements divmaps = doc.select("#divmap");

then you can remove (or alter) the elements in the DOM. 那么您可以删除(或更改)DOM中的元素。

divmaps.remove();
triphtml = doc.html();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM