从HTML页面提取基于XPATH的内容

Question

I m trying to extract content based on given xpath. 我试图基于给定的xpath提取内容。 When it is just one element i want to extract, there is no issue. 当我只想提取一个元素时，就没有问题。 When I have a list of items matching that xpath, then i get the nodelist and i can extract the values. 当我有一个与该xpath匹配的项目列表时，便得到了节点列表，并且可以提取值。

However, there are a couple items related to each other forming a group, and that group repeats itself. 但是，有几个相互关联的项目组成一个小组，而该小组会重复自己。

One way I could do is to get the nodelist of parent node of all such groups and then apply SAX based parsing technique to extract information. 我可以做的一种方法是获取所有此类组的父节点的节点列表，然后应用基于SAX的解析技术来提取信息。 But this would introduce pattern specific coding. 但这会引入模式特定的编码。 I want to make it generic. 我想使其通用。 ex. 例如

<html><body>
<!--... a lot divs and other tags ... -->
<div class="divclass">
<item>
     <item_name>blah1</item_name>
     <item_qty>1</item_qty>
     <item_price>100</item_price>
</item>
</div>
<div class="divclass">
<item>
     <item_name>blah2</item_name>
     <item_qty>2</item_qty>
     <item_price>200</item_price>
</item>
</div>
<div class="divclass">
<item>
     <item_name>blah3</item_name>
     <item_qty>3</item_qty>
     <item_price>300</item_price>
</item>
</div>
</body></html>

I could easily write code for this xml but not a generic one which could parse any given specification. 我可以轻松地为此 xml编写代码，但不能解析任何给定的规范的通用代码。

I should be able to create a list of map of attribute-value from above. 我应该能够从上面创建一个attribute-value map list 。

Has anyone tried this? 有人尝试过吗？

EDIT List of input xpaths: 编辑输入xpath的列表：

1. "html:div[@class='divclass']/item/item_name"
2. "html:div[@class='divclass']/item/item_qty"
3. "html:div[@class='divclass']/item/item_price"

Expected output in simple text: 预期输出为简单文本：

 item_name:blah1;item_qty:1;item_price:100
 item_name:blah2;item_qty:2;item_price:200
 item_name:blah3;item_qty:3;item_price:300

Key point here is, if I apply each xpath separately, it would fetch me results vertically, ie first one will fetch all item_names, second will fetch all qtys. 这里的关键是，如果我分别应用每个xpath，它将垂直获取我的结果，即第一个将获取所有item_name，第二个将获取所有qty。 So I'll loose the co-relation within these pieces. 因此，我将放松这些部分中的相互关系。

Hope this clears my requirements. 希望这能清除我的要求。

Thanks Nayn 谢谢内恩

Answer 1

I am not sure I got your question, but it sounds like you want to use XPath on HTML documents. 我不确定是否收到您的问题，但是听起来您想在HTML文档上使用XPath。

To use XPath, the HTML document being prased needs to be well-formed. 要使用XPath，需要正确编写HTML文档。 There are several HTML parsers for Java; 有几个Java的HTML解析器。 this article compares 4 of them. 本文比较了其中的4个。

HtmlCleaner seems to provide what you are after. HtmlCleaner似乎提供了您所需要的。 It allows a subset of XPaths to be performed on "cleaned-up" HTML documents. 它允许在“清理过的” HTML文档上执行XPath的子集。 Apparently it doesn't support the full set of XPath expressions though, see the documentation . 显然，它不支持整套XPath表达式，请参阅文档。

If you require more complex XPath expressions than what HtmlCleaner supports, you may need to use the javax.xml.xpath package with a well-formed XHTML document. 如果您需要比HtmlCleaner支持的更复杂的XPath表达式，则可能需要将javax.xml.xpath包与格式正确的XHTML文档一起使用。 JTidy can convert an HTML document to an XHTML one. JTidy可以将HTML文档转换为XHTML文档。

I hope this answers your question. 我希望这回答了你的问题。

Answer 2

I think XQuery is a great solution for screen scraping. 我认为XQuery是抓取屏幕的绝佳解决方案。 You can use the Saxon processor for executing your xqueries. 您可以使用Saxon处理器执行xqueries。 Moreover, you can use Piggy Bank Firefox extension to easily find the XPath expressions, regarding the content you want to extract from a web page, that you can use within your xqueries. 此外，您可以使用Piggy Bank Firefox扩展轻松地找到XPath表达式，该表达式与要从网页中提取的内容有关，您可以在xqueries中使用它们。

Answer 3

Why not apply XPath in two steps. 为什么不分两个步骤应用XPath。

First an XPath(s) to get the records (the lines in your output): 首先使用XPath来获取记录（输出中的行）：

//div[@class='divclass']/item

Then the XPath(s) to get the fields (the columns), relative to each record: 然后，XPath获取相对于每条记录的字段（列）：

item_name
item_qty
item_price

Here's working code (in Javascript, Windows scripting), gives you the output you want: 这是工作代码（使用Javascript，Windows脚本），可为您提供所需的输出：

var doc = new ActiveXObject("MSXML.DOMDocument");
doc.load("test.xml");

// XPATH #1
var recordXPath = "//div[@class='divclass']/item";
// XPATHS #2, in a dictionary ("field name":"XPath")
var fieldXPaths = { item_name : "item_name",
                    item_qty : "item_name",
                    item_price : "item_price" };

var items = doc.selectNodes(recordXPath);
for (var itemCtr = 0; itemCtr < items.length; itemCtr++) {
    var item = items[itemCtr];
    var fieldEntries = [];

    for (var fieldName in fieldXPaths) {
        var fieldXPath = fieldXPaths[fieldName];
        var fieldNode = item.selectSingleNode(fieldXPath);
        fieldEntries.push(fieldNode.tagName + ":" + fieldNode.text);
    }
    WScript.Echo(fieldEntries.join(";"));
}

Answer 4

I don't understand what you want to achieve and how it relates to XPath. 我不了解您想要实现什么以及它与XPath的关系。 If you want to map XML to Java objects then JAXB might help, but it is based on XML schemas, not on XPath. 如果要将XML映射到Java对象，则JAXB可能会有所帮助，但它基于XML模式，而不是XPath。

Answer 5

I don't know if this helps but I use XSLT to go go the other way from data to HTML. 我不知道这是否有帮助，但是我使用XSLT从数据到HTML的另一种方式。 Seems to me that you just need to structure the XPATH execution a little and XSLT is good for this. 在我看来，您只需要稍微结构化XPATH执行，而XSLT就可以了。

从HTML页面提取基于XPATH的内容

问题描述

5 个解决方案

解决方案1
3 2010-08-21 13:30:13

解决方案2
2 已采纳 2010-08-25 18:20:46

解决方案3
1 2010-08-25 17:12:59

解决方案4
0 2010-07-29 16:07:44

解决方案5
0 2010-08-25 22:40:36

从HTML页面提取基于XPATH的内容

问题描述

5 个解决方案

解决方案1 3 2010-08-21 13:30:13

解决方案2 2 已采纳 2010-08-25 18:20:46

解决方案3 1 2010-08-25 17:12:59

解决方案4 0 2010-07-29 16:07:44

解决方案5 0 2010-08-25 22:40:36

解决方案1
3 2010-08-21 13:30:13

解决方案2
2 已采纳 2010-08-25 18:20:46

解决方案3
1 2010-08-25 17:12:59

解决方案4
0 2010-07-29 16:07:44

解决方案5
0 2010-08-25 22:40:36