如何用Java中的Jsoup解析javascript变量中的html？

Question

我正在使用Jsoup来解析html文件并从元素中提取所有可见文本。 问题是javascript变量中有一些html位显然被忽略了。 什么是获得这些比特的最佳解决方案？

例：

<!DOCTYPE html>
<html>
<head>
    <script>
        var html = "<span>some text</span>";
    </script>
</head>
<body>
    <p>text</p>
</body>
</html>

在这个例子中，Jsoup只从p标签中获取文本，这是它应该做的。 如何从var html span中获取文本？ 该解决方案必须应用于数千个不同的页面，因此我不能依赖具有相同名称的javascript变量。

Answer 1

您可以使用Jsoup将所有<script> -tags解析为DataNode -objects。

DataNode

一个数据节点，用于样式内容，脚本标记等，其中内容不应显示在text（）中。

 Elements scriptTags = doc.getElementsByTag("script");

这将为您提供标记<script>所有元素。

然后，您可以使用getWholeData() - 方法来提取节点。

 // Get the data contents of this node. String getWholeData()

 for (Element tag : scriptTags){                
        for (DataNode node : tag.dataNodes()) {
            System.out.println(node.getWholeData());
        }        
  }

Jsoup API - DataNode

Answer 2

我对答案不太确定，但在此之前我看到了类似的情况。

您可能可以使用Jsoup和手动解析来根据该答案获取文本。

我只是根据你的具体情况修改那段代码：

Document doc = ...
Element script = doc.select("script").first(); // Get the script part


Pattern p = Pattern.compile("(?is)html = \"(.+?)\""); // Regex for the value of the html
Matcher m = p.matcher(script.html()); // you have to use html here and NOT text! Text will drop the 'html' part


while( m.find() )
{
    System.out.println(m.group()); // the whole html text
    System.out.println(m.group(1)); // value only
}

希望它会有所帮助。

如何用Java中的Jsoup解析javascript变量中的html？

问题描述

2 个解决方案

解决方案1
3 2013-07-29 11:42:33

解决方案2
0 2013-11-02 04:16:53

如何用Java中的Jsoup解析javascript变量中的html？

问题描述

2 个解决方案

解决方案1 3 2013-07-29 11:42:33

解决方案2 0 2013-11-02 04:16:53

解决方案1
3 2013-07-29 11:42:33

解决方案2
0 2013-11-02 04:16:53