简体   繁体   English

解析html - > xml并使用Xpath查询

[英]Parsing html -> xml and querying with Xpath

I want to parse a html page to get some data. 我想解析一个html页面来获取一些数据。 First, I convert it to XML document using SgmlReader . 首先,我使用SgmlReader将其转换为XML文档。 Then, I load the result to XMLDocument and then navigate through XPath: 然后,我将结果加载到XMLDocument,然后导航到XPath:

//contains html document
var loadedFile = LoadWebPage();

...

Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;

sgmlReader.InputStream = new StringReader(loadedFile);

XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);

This code works fine for most cases, except on this site - www.arrow.com (try to search something like OP295GS). 这个代码适用于大多数情况,除了在这个网站上 - www.arrow.com (尝试搜索像OP295GS这样的东西)。 I can get a table with result using the following XPath: 我可以使用以下XPath获取包含结果的表:

var node = doc.SelectSingleNode(".//*[@id='results-table']");

This gives me a node with several child nodes: 这给了我一个带有几个子节点的节点:

[0]         {Element, Name="thead"}  
[1]         {Element, Name="tbody"}  
[2]         {Element, Name="tbody"}  
FirstChild   {Element, Name="thead"}

Ok, let's try to get some child nodes using XPath. 好吧,让我们尝试使用XPath获取一些子节点。 But this doesn't work: 但这不起作用:

var childNodes = node.SelectNodes("tbody");
//childnodes.Count = 0

This also: 这也是:

var childNode = node.SelectSingleNode("thead");
// childNode = null

And even this: 甚至这个:

var childNode = doc.SelectSingleNode(".//*[@id='results-table']/thead")

What can be wrong in Xpath queries? Xpath查询有什么问题?


I've just tried to parse that HTML page with Html Agility Pack and my XPath queries work good. 我刚刚尝试使用Html Agility Pack解析该HTML页面,并且我的XPath查询运行良好。 But my application use XmlDocument inside, Html Agility Pack doesn't suit me. 但我的应用程序使用XmlDocument, Html Agility Pack不适合我。


I even tried the following trick with Html Agility Pack , but Xpath queries doesn't work also: 我甚至用Html Agility Pack尝试了以下技巧,但是Xpath查询也不起作用:

//let's parse and convert HTML document using HTML Agility Pack and then load
//the result to XmlDocument
HtmlDocument xmlDocument = new HtmlDocument();
xmlDocument.OptionOutputAsXml = true;
xmlDocument.Load(new StringReader(webPage));

XmlDocument document = new XmlDocument();
document.LoadXml(xmlDocument.DocumentNode.InnerHtml);

Perhaps, web page contains errors (not all tags are closed and so on), but in spite of this I can see child nodes (through Quick Watch in Visual Studio), but cannot access them through XPath. 也许,网页包含错误(并非所有标签都关闭等等),但尽管如此,我可以看到子节点(通过Visual Studio中的Quick Watch),但无法通过XPath访问它们。


My XPath queries works correctly in Firefox + FirePath + XPather plugins, but don't work in .net XmlDocument :( 我的XPath查询在Firefox + FirePath + XPather插件中正常工作,但在.net XmlDocument中不起作用:(

I have not used SqmlReader, but every time I have seen this problem it has been due to namespaces. 我没有使用SqmlReader,但每次我看到这个问题都是由于命名空间。 A quick look at the HTML on www.arrow.com shows that this node has a namespace (note the xmlns:javaurlencoder): 快速浏览www.arrow.com上的HTML,可以看出这个节点有一个命名空间(注意xmlns:javaurlencoder):

<form name="CatSearchForm" method="post" action="http://components.arrow.com/part/search/OP295GS" xmlns:javaurlencoder="java.net.URLEncoder">

This code is how I loop through all nodes in a document to see which ones have namespaces and which don't. 这段代码是我如何循环遍历文档中的所有节点,以查看哪些节点具有名称空间,哪些不具有名称空间。 If the node you are looking for or any of its parents have namespaces, you must create a XmlNamespaceManager and pass it along with your call to SelectNodes() . 如果您要查找的节点或其任何父节点具有名称空间,则必须创建一个XmlNamespaceManager并将其与您对SelectNodes()调用一起SelectNodes()

This is kind of annoying, so another idea might be to strip all the xmlns: attributes out of the XML before loading it into a XmlDocument . 这有点烦人,所以另一个想法可能是在将XML加载到XmlDocument之前从XML中删除所有xmlns:属性。 Then, you won't need to fool with XmlNamespaceManager ! 然后,您不需要使用XmlNamespaceManager傻瓜!

XmlDocument doc = new XmlDocument();
doc.Load(@"C:\temp\X.loadtest.xml");

Dictionary<string, string> namespaces = new Dictionary<string, string>();
XmlNodeList nlAllNodes = doc.SelectNodes("//*");
foreach (XmlNode n in nlAllNodes)
{
    if (n.NodeType != XmlNodeType.Element) continue;

    if (!String.IsNullOrEmpty(n.NamespaceURI) && !namespaces.ContainsKey(n.Name))
    {
        namespaces.Add(n.Name, n.NamespaceURI);
    }
}

// Inspect the namespaces dictionary to write the code below

XmlNamespaceManager nMgr = new XmlNamespaceManager(doc.NameTable);
// Sometimes this works
nMgr.AddNamespace("ns1", doc.DocumentElement.NamespaceURI); 
// You can make the first param whatever you want, it just must match in XPath queries
nMgr.AddNamespace("javaurlencoder", "java.net.URLEncoder"); 

XmlNodeList iter = doc.SelectNodes("//ns1:TestProfile", nMgr);
foreach (XmlNode n in iter)
{
    // Do stuff
}

To be honest when I am trying to get information from a website I use regex. 说实话,当我试图从网站获取信息时,我使用正则表达式。 Ok Kore Nordmann (in his php blog) thinks, this is not good. Ok Kore Nordmann(在他的php博客中)认为,这并不好。 But some of the comments tell differently. 但有些评论的说法不同。

http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html

http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

But it is in php, so sorry for this =) Hope it helps anyway. 但它是在PHP,所以抱歉这=)希望它有帮助无论如何。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM