简体   繁体   English

如何在Java中更改HTML标记内容?

[英]How to change HTML tag content in Java?

How can I change HTML content of tag in Java? 如何在Java中更改标记的HTML内容? For example: 例如:

before: 之前:

<html>
    <head>
    </head>
    <body>
        <div>text<div>**text**</div>text</div>
    </body>
</html>

after: 后:

<html>
    <head>
    </head>
    <body>
        <div>text<div>**new text**</div>text</div>
    </body>
</html>

I tried JTidy, but it doesn't support getTextContent . 我尝试过JTidy,但它不支持getTextContent Is there any other solution? 还有其他解决方案吗?


Thanks, I want parse no well-formed HTML. 谢谢,我想解析没有格式良好的HTML。 I tried TagSoup, but when I have this code: 我试过TagSoup,但是当我有这个代码时:

<body>
sometext <div>text</div>
</body>

and I want change "sometext" to "someAnotherText," and when I use {bodyNode}.getTextContent() it gives me: "sometext text"; 我希望将“sometext”更改为“someAnotherText”,当我使用{bodyNode}.getTextContent()它会给我:“sometext text”; when I use setTextContet("someAnotherText"+{bodyNode}.getTextContent()) , and serialize these structure, the result is <body>someAnotherText sometext text</body> , without <div> tags. 当我使用setTextContet("someAnotherText"+{bodyNode}.getTextContent())并序列化这些结构时,结果是<body>someAnotherText sometext text</body> ,没有<div>标签。 This is a problem for me. 这对我来说是个问题。

Unless you are absolutely sure that the HTML will be valid and well formed, I'd strongly recommend to use an HTML parser, something like TagSoup , Jericho , NekoHTML , HTML Parser , etc, the two first being especially powerful to parse any kind of crap :) 除非你完全确定HTML是有效且格式良好的,否则我强烈建议使用HTML解析器,比如TagSoupJerichoNekoHTMLHTML Parser等,这两个首先特别强大,可以解析任何类型的废话:)

For example, with HTML Parser (because the implementation is very easy), using a visitor , provide your own NodeVisitor : 例如,使用HTML Parser (因为实现非常简单),使用访问者 ,提供您自己的NodeVisitor

public class MyNodeVisitor extends NodeVisitor {
    public MyNodeVisitor() {
    }

    public void visitStringNode (Text string)
    {
        if (string.getText().equals("**text**")) {
            string.setText("**new text**");
        }
    }
}

Then, create a Parser , parse the HTML string and visit the returned node list: 然后,创建一个Parser ,解析HTML字符串并访问返回的节点列表:

Parser parser = new Parser(htmlString);
NodeList nl = parser.parse(null);
nl.visitAllNodesWith(new MyNodeVisitor());
System.out.println(nl.toHtml());

This is just one way to implement this, pretty straight forward. 这只是实现这一目标的一种方式,非常直接。

Provided that your HTML is a well-formed XML (if it is not then you may use JTidy to tidify it), you can parse it using DOM or SAX parser. 如果您的HTML是格式良好的XML(如果不是,那么您可以使用JTidy来整理它),您可以使用DOM或SAX解析器来解析它。 DOM is probably easier if your document is not huge. 如果您的文档不是很大,DOM可能会更容易。

Something like this will do the trick if your text is the only child of a node with id="id": 如果您的文本是id =“id”的节点的唯一子节点,那么这样的东西就可以解决问题:

Document d = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(file);
Element e = d.getElementById("id");
Node text = e.getFirstChild();
text.setNodeValue(process(text.getNodeValue());

You may save d afterwards to a file. 您可以将d之后保存到文件中。

There are a bunch of Open source Java HTML parsers listed here . 现在有很多上市的开源Java的HTML解析器这里

I'm not sure what's most commonly used, but this one (just called HTML parser) will probably do what you want. 我不确定最常用的是什么,但是这个 (只是称为HTML解析器)可能会做你想要的。 It has functions to modify your tree and write it back out. 它具有修改树并将其写回的功能。

In general you have a HTML document that you want to extract data from. 通常,您有一个要从中提取数据的HTML文档。 You know generally the structure of the HTML document. 您通常知道HTML文档的结构。

There are several parser libraries but the best one is Jsoup ,you can use the DOM methods to navigate your document and update values.In your case you need to read your file and use the attribute setter methods. 有几个解析器库,但最好的是Jsoup ,您可以使用DOM方法导航文档并更新值。在您的情况下,您需要读取文件并使用属性setter方法。

Sample XHTML file : 示例XHTML文件:

<?xml version="1.0" encoding="UTF-8"?>
<!--
To change this license header, choose License Headers in Project Properties.
To change this template file, choose Tools | Templates
and open the template in the editor.
-->
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>Example</title>
    </head>
    <body>
        <p id="content">Hello World</p>

    </body>
</html>

Java code : Java代码:

     File input = new File("D:\\Projects\\Odata Project\\Odata\\src\\web\\html\\inscription_template.xhtml");
            org.jsoup.nodes.Document doc = Jsoup.parse(input,null);
            org.jsoup.nodes.Element content = doc.getElementById("content");
            System.out.println(content.text("Hi How are you ?"));
            System.out.println(content.text());
            System.out.println(doc);

Output after execution: 执行后输出:

<p id="content">Hi How are you ?</p>
Hi How are you ?
<!--?xml version="1.0" encoding="UTF-8"?-->
<!--
To change this license header, choose License Headers in Project Properties.
To change this template file, choose Tools | Templates
and open the template in the editor.
--><!doctype html>
<html xmlns="http://www.w3.org/1999/xhtml">
 <head> 
  <title>Example</title> 
 </head> 
 <body> 
  <p id="content">Hi How are you ?</p>   
 </body>
</html>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM