简体   繁体   English

如何使用Java将单个HTML拆分为多个HTML文件

[英]How to split single HTML to Multiple HTML Files using Java

I had an issue where i want to split single HTML file to multiple HTML files using Java, the html file has multiple chapters of a text book in a in a single HTML file but i want each chapter in single HTML file, each chapter start can be identified using h2 tag with some id. 我有一个问题,我想使用Java将单个HTML文件拆分为多个HTML文件,该html文件在单个HTML文件中包含一本教科书的多个章节,但是我希望单个HTML文件中的每一章都可以开头使用带有某些ID的h2标签进行标识。 Attached a sample HTML file that i want to split it to multiple HTML files. 附加了一个示例HTML文件,我想将其拆分为多个HTML文件。

 <?xml version='1.0' encoding='utf-8'?> <!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> <head> <meta name="generator" content="HTML Tidy for Linux (vers 7 December 2008), see www.w3.org"/> <title>Sample HTML</title> <link rel="stylesheet" href="0.css" type="text/css"/> <link rel="stylesheet" href="1.css" type="text/css"/> <link rel="stylesheet" href="sample.css" type="text/css"/> <meta name="generator" content="sample content"/> </head> <body><div class="c2"><br/> <br/> <br/> <br/></div> <h2 id="pg00007">Chapter 7</h2> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p><a id="link2HCH0008"><!-- H2 anchor --></a></p> <div class="c2"><br/> <br/> <br/> <br/></div> <h2 id="pg00008">Chapter 8</h2> p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p><a id="link2HCH0009"><!-- H2 anchor --></a></p> <div class="c2"><br/> <br/> <br/> <br/></div> <h2 id="pg00009">Chapter 9</h2> p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p><a id="link2HCH0010"><!-- H2 anchor --></a></p> <div class="c2"><br/> <br/> <br/> <br/></div> <h2 id="pg00010">Chapter 10</h2> p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p>sample paragraph 1</p> <p><a id="link2HCH0011"><!-- H2 anchor --></a></p> </body></html> 

Not entirely sure whether it would work but i guess you can take a parser like http://jsoup.org/ and use it as follows: 不确定是否能正常工作,但我想您可以使用http://jsoup.org/这样的解析器,并按如下方式使用它:

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements chapters = doc.select("h2"); 

you then have to extract the content of the element and persist it as a new HTML file (including body, etc) 然后,您必须提取元素的内容并将其保存为新的HTML文件(包括正文等)

Finally i'm able to do it here is the solution to split html as per my need in the question 最后,我能够做到这一点,这是根据我在问题中需要拆分html的解决方案

public class App {
public static void JsoupReader(){
    File input = new File("src/resources/sample_book.htm.html");
    try {
        Document doc = Jsoup.parse(input, "UTF-8");
        Element head = doc.select("head").first();
        Element firstH2  = doc.select("h2").first();
        Elements siblings = firstH2.siblingElements();
        String h2Text = firstH2.html();
        List<Element> elementsBetween = new ArrayList<Element>();
        for(int i=1;i<siblings.size(); i++){
            Element sibling = siblings.get(i);
            if(!"h2".equals(sibling.tagName())){
                elementsBetween.add(sibling);
            }else{
                processElementsBetween(h2Text, head, elementsBetween);
                  elementsBetween.clear();
                  h2Text = sibling.html();
            }
        }

         if (! elementsBetween.isEmpty())
                processElementsBetween(h2Text, head, elementsBetween);


    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
}

private static void processElementsBetween(String h2Text,Element head,
        List<Element> elementsBetween) throws IOException {

    File newHtmlFile = new File("src/resources/"+h2Text+".html");
    StringBuffer htmlString = new StringBuffer("");
    htmlString.append("<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\">");
    htmlString.append(head);
    htmlString.append("<body>"
            +"<div class=\"c2\">"
            +"<br/>"
            +"<br/>"
            +"<br/>"
            +"<br/>"
            +"</div>");
      System.out.println("---");
      for (Element element : elementsBetween) {
          htmlString.append(element.toString());
              }
      htmlString.append("</body></html>");
      FileUtils.writeStringToFile(newHtmlFile, htmlString.toString());
    }   

Thanks for your help uniknow and realskeptic for your criticism. 感谢您的帮助, 未知怀疑的批评。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何根据java中的字符长度将HTML文件拆分为多个 - How to split an HTML file into multiple according to length of characters in java 如何从多个 HTML 文件中读取数据并使用 Aspose 填充到单个 docx/pdf? - how to read data from multiple HTML files and populated to single docx/pdf using Aspose? 如何使用 Java 在 Spark DataFrame 中将单行拆分为多行 - How to split single row into multiple rows in Spark DataFrame using Java 如何使用java将一个XML文件拆分成多个XML文件 - how to split an XML file into multiple XML files using java 使用Java将多个xml文件转换为html文件 - Multiple xml files transform using java to html file 将多个句子的字符串拆分为单个句子,并用html标记将其包围 - Split a string of multiple sentences into single sentences and surround them with html tags 如何在具有登录信息的JAVA中使用URLConnection获取html文件 - how to get html files using URLConnection in JAVA with login information 使用Java将HDFS文件拆分为多个本地文件 - Split HDFS files into multiple local files using Java 使用 Java 在 Spark 中将单列拆分为多列 - Split a single column into multiple columns in Spark using Java 如何使用Java替换目录中的多个HTML标题? - How to replace multiple HTML titles in a directory using Java?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM