简体   繁体   English

如何使用jsoup从具有多个html标签的html文件中提取正文内容

[英]How to extract body contents from html file with more than one html tags with jsoup

I need to parse an html file, containing more than one html tags, with jsoup. 我需要使用jsoup解析一个HTML文件,其中包含多个HTML标记。

I split the document into many html elements and I am able to extract some tags, like title 我将文档拆分为许多html元素,并且能够提取一些标签,例如title

Document doc = Jsoup.parse(file, "UTF-8");
Elements el = doc.getElementsByTag("html");
for (Element e : el) {
   writer = new PrintWriter(output);
   writer.println(e.select("title"));
   writer.println(e.select("body"));
   writer.close();
}

Output 产量

<title>titletext</title>

but it seems to ignore the existance of the body tag in every element. 但似乎忽略了body标签在每个元素中的存在。

Using Document.body() just spits all the contents of the body tags together. 使用Document.body()只会将body标签的所有内容吐在一起。

Since I can't get a Document from each Element to use body() on, how can I extract the body tag from each Element seperately? 由于无法从每个Element获取文档来使用body() ,因此如何分别从每个Element中提取body标签?

Assumed you have such a file : 假设您有这样一个文件:

<!DOCTYPE html>
<html>
<head>
<title>Page Title 1</title>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph on page 1.</p>
</body>
</html> 

<html>
<head>
<title>Page Title 2</title>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph on page 2.</p>
</body>
</html> 

<html>
<head>
<title>Page Title 3</title>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph on page 3.</p>
</body>
</html> 

you can split your file at the end of each html part (</html>) and parse each part separately. 您可以在每个html部分(</ html>)的末尾拆分文件,然后分别解析每个部分。 Example : 范例:

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupTest {

    public static void main(String argv[]) throws IOException {
        String multihtml = new String(Files.readAllBytes(Paths.get("C:\\users\\manna\\desktop\\test.html")));
        String [] htmlParts = multihtml.split("(?<=</html>)");
        Document doc;  

        for(String part : htmlParts){
            doc = Jsoup.parse(part);
            System.out.println("");
            System.out.println("title : "+doc.title());
            System.out.println("");
            System.out.println(doc.body().outerHtml());
            System.out.println("******************************************");            
        }       
    } 
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM