如何使用jsoup從具有多個html標簽的html文件中提取正文內容

Question

我需要使用jsoup解析一個HTML文件，其中包含多個HTML標記。

我將文檔拆分為許多html元素，並且能夠提取一些標簽，例如title

Document doc = Jsoup.parse(file, "UTF-8");
Elements el = doc.getElementsByTag("html");
for (Element e : el) {
   writer = new PrintWriter(output);
   writer.println(e.select("title"));
   writer.println(e.select("body"));
   writer.close();
}

產量

<title>titletext</title>

但似乎忽略了body標簽在每個元素中的存在。

使用Document.body()只會將body標簽的所有內容吐在一起。

由於無法從每個Element獲取文檔來使用body() ，因此如何分別從每個Element中提取body標簽？

Answer 1

假設您有這樣一個文件：

<!DOCTYPE html>
<html>
<head>
<title>Page Title 1</title>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph on page 1.</p>
</body>
</html> 

<html>
<head>
<title>Page Title 2</title>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph on page 2.</p>
</body>
</html> 

<html>
<head>
<title>Page Title 3</title>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph on page 3.</p>
</body>
</html>

您可以在每個html部分（</ html>）的末尾拆分文件，然后分別解析每個部分。 范例：

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupTest {

    public static void main(String argv[]) throws IOException {
        String multihtml = new String(Files.readAllBytes(Paths.get("C:\\users\\manna\\desktop\\test.html")));
        String [] htmlParts = multihtml.split("(?<=</html>)");
        Document doc;  

        for(String part : htmlParts){
            doc = Jsoup.parse(part);
            System.out.println("");
            System.out.println("title : "+doc.title());
            System.out.println("");
            System.out.println(doc.body().outerHtml());
            System.out.println("******************************************");            
        }       
    } 
}

如何使用jsoup從具有多個html標簽的html文件中提取正文內容

問題描述

1 個解決方案

解決方案1
0 2016-11-28 14:23:28

如何使用jsoup從具有多個html標簽的html文件中提取正文內容

問題描述

1 個解決方案

解決方案1 0 2016-11-28 14:23:28

解決方案1
0 2016-11-28 14:23:28