用于查找未关闭的HTML标记的Java

Question

How can I find in a HTML string a tag which has no closing tag and close it? 如何在HTML字符串中找到没有结束标记并关闭它的标记？

HTML string with a tag without close tag: 带有不带标记的标记的HTML字符串：

<html> 
    <head> </head> 
    <body> 
        <p style="margin-top: 0"> dasa </p> 
        <input size="1" type="text" value="a"> 
    </body> 
</html>

to 至

<html> 
    <head> </head> 
    <body> 
        <p style="margin-top: 0"> dasa </p> 
        <input size="1" type="text" value="a"> </input>
    </body> 
</html>

Thanks! 谢谢！

Answer 1

I have Two Options for you (I like 2nd one the most.) 我有两个选项（我最喜欢第二个。）

1. http://home.ccil.org/~cowan/XML/tagsoup 1. http://home.ccil.org/~cowan/XML/tagsoup

 instead of parsing well-formed or valid XML, 
 parses HTML as it is found in the wild: 
 poor, nasty and brutish, though quite often far from short.
 TagSoup is designed for
 people who have to process this stuff using 
 some semblance of a rational application   
 design. By providing a SAX interface, 
 it allows standard XML tools to be applied to even the
 worst HTML. TagSoup also includes a command-line processor that reads
 HTML files and can generate either clean HTML or well-formed XML 
 that is a close approximation to XHTML.

This the tool we are using. 这是我们使用的工具。 I mentioned another tool but im not using it. 我提到了另一种工具，但我没有使用它。

2. http://htmlcleaner.sourceforge.net/download.php 2. http://htmlcleaner.sourceforge.net/download.php

Just download the jar file and unzip it. 只需下载jar文件并解压缩即可。 and Run the jar file like below. 并运行如下所示的jar文件。

Go to the Location 转到位置
java -jar htmlcleaner-2.8.jar src= http://google.com It will correct missing tags and give output. java -jar htmlcleaner-2.8.jar src = http://google.com它将纠正缺失的标签并提供输出。

Eg - I have Html file with following contents 例如 - 我有以下内容的Html文件

<table>
<tr>
<td>Wrong Table

it gives the out like below 它给出了如下所示

C:\Users\Lasitha Benaragama\Downloads\htmlcleaner-2.8>java -jar htmlcleaner-2.8.
jar src=http://localhost/fun/test.html
Apr 24, 2014 12:23:10 PM org.htmlcleaner.audit.HtmlModificationListenerLogger fi
reHtmlError
INFO: fireHtmlError:RequiredParentMissing(true) at tr
Apr 24, 2014 12:23:10 PM org.htmlcleaner.audit.HtmlModificationListenerLogger fi
reHtmlError
INFO: fireHtmlError:UnclosedTag(true) at table
Apr 24, 2014 12:23:10 PM org.htmlcleaner.audit.HtmlModificationListenerLogger fi
reHtmlError
INFO: fireHtmlError:UnclosedTag(true) at tbody
Apr 24, 2014 12:23:10 PM org.htmlcleaner.audit.HtmlModificationListenerLogger fi
reHtmlError
INFO: fireHtmlError:UnclosedTag(true) at tr
Apr 24, 2014 12:23:10 PM org.htmlcleaner.audit.HtmlModificationListenerLogger fi
reHtmlError
INFO: fireHtmlError:UnclosedTag(true) at td
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head />
<body><table>
<tbody><tr>
<td>Wrong Table</td></tr></tbody></table></body></html>

I tested your html also, The output is 我也测试了你的HTML，输出是

C:\Users\Lasitha Benaragama\Downloads\htmlcleaner-2.8>java -jar htmlcleaner-2.8.
jar src=http://localhost/fun/test.html
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head />
<body>

        <p style="margin-top: 0"> dasa </p>
        <input size="1" type="text" value="a" />

</body></html>
C:\Users\Lasitha Benaragama\Downloads\htmlcleaner-2.8>

Thanks. 谢谢。

Answer 2

You could keep a stack of the tags. 你可以保留一堆标签。 As you come across an open tag, push it onto the stack. 当您遇到一个打开的标签时，将其推入堆栈。 When you come to a closing tag, pop off and make sure it matches the closing tag you are at. 当您到达结束标记时，弹出并确保它与您所在的结束标记匹配。 If it's not, that is a missing tag. 如果不是，那就是丢失的标签。

Answer 3

Below code works perfectly for me: 下面的代码非常适合我：

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;

import org.ccil.cowan.tagsoup.Parser;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.io.SAXReader;
import org.dom4j.io.XMLWriter;
import org.xml.sax.SAXException;

public class EmailUtil {

    public static String getValidHtml(String html) throws SAXException, DocumentException, IOException {
        ByteArrayOutputStream baos = null;
        SAXReader reader = new SAXReader(Parser.class.getName());
        Document doc = reader.read(new ByteArrayInputStream(html.getBytes()));
        baos = new ByteArrayOutputStream();
        XMLWriter writer;
        writer = new XMLWriter(baos);
        writer.write(doc);
        return baos == null ? null : baos.toString();
    }
}

用于查找未关闭的HTML标记的Java

问题描述

3 个解决方案

解决方案1
3 已采纳 2014-04-24 05:39:02

解决方案2
0 2014-04-24 05:31:39

解决方案3
0 2019-08-29 22:55:40

用于查找未关闭的HTML标记的Java

问题描述

3 个解决方案

解决方案1 3 已采纳 2014-04-24 05:39:02

解决方案2 0 2014-04-24 05:31:39

解决方案3 0 2019-08-29 22:55:40

解决方案1
3 已采纳 2014-04-24 05:39:02

解决方案2
0 2014-04-24 05:31:39

解决方案3
0 2019-08-29 22:55:40