简体   繁体   English

C# - 解析网页的最佳方法?

[英]C# - Best Approach to Parsing Webpage?

I've saved an entire webpage's html to a string, and now I want to grab the "href" values from the links, preferably with the ability to save them to different strings later. 我已经将整个网页的html保存为字符串,现在我想从链接中获取“href”值 ,最好能够将它们保存到不同的字符串中。 What's the best way to do this? 最好的方法是什么?

I've tried saving the string as an .xml doc and parsing it using an XPathDocument navigator, but (surprise surprise) it doesn't navigate a not-really-an-xml-document too well. 我已经尝试将字符串保存为.xml文档并使用XPathDocument导航器解析它,但是(惊喜)它并没有很好地导航非真正的xml文档。

Are regular expressions the best way to achieve what I'm trying to accomplish? 正则表达式是实现我想要实现的目标的最佳方式吗?

I can recommend the HTML Agility Pack . 我可以推荐HTML Agility Pack I've used it in a few cases where I needed to parse HTML and it works great. 我在一些需要解析HTML的情况下使用过它并且效果很好。 Once you load your HTML into it, you can use XPath expressions to query the document and get your anchor tags (as well as just about anything else in there). 将HTML加载到其中后,您可以使用XPath表达式查询文档并获取锚标记(以及其中的任何其他内容)。

HtmlDocument yourDoc = // load your HTML;
int someCount = yourDoc.DocumentNode.SelectNodes("your_xpath").Count;

Regular expressions are one way to do it, but it can be problematic. 正则表达式是一种方法,但它可能会有问题。

Most HTML pages can't be parsed using standard html techniques because, as you've found out, most don't validate. 大多数HTML页面都无法使用标准的html技术进行解析,因为正如您所发现的那样,大多数HTML页面都没有验证。

You could spend the time trying to integrate HTML Tidy or a similar tool, but it would be much faster to just build the regex you need. 您可以花时间尝试集成HTML Tidy或类似工具,但只需构建所需的正则表达式会快得多。

UPDATE UPDATE

At the time of this update I've received 15 up and 9 downvotes. 在此次更新时,我收到了15个up和9个downvotes。 I think that maybe people aren't reading the question nor the comments on this answer. 我想也许人们不是在阅读这个问题,也不是对这个答案的评论。 All the OP wanted to do was grab the href values. OP想要做的就是获取href值。 That's it. 而已。 From that perspective, a simple regex is just fine. 从这个角度来看,一个简单的正则表达式就好了。 If the author had wanted to parse other items then there is no way I would recommend regex as I stated at the beginning, it's problematic at best. 如果作者想要解析其他项目,那么就像我在开头所说的那样,我无法推荐正则表达式,这在最好的情况下是有问题的。

为了处理各种形状和大小的HTML,我更喜欢使用HTMLAgility包@ http://www.codeplex.com/htmlagilitypack,它允许您针对所需的节点编写XPath并在集合中获得返回。

Probably you want something like the Majestic parser: http://www.majestic12.co.uk/projects/html_parser.php 可能你想要像Majestic解析器这样的东西: http//www.majestic12.co.uk/projects/html_parser.php

There are a few other options that can deal with flaky html, as well. 还有一些其他选项可以处理片状html。 The Html Agility Pack is worth a look, as someone else mentioned. 正如其他人提到的那样,Html Agility Pack值得一看。

I don't think regexes are an ideal solution for HTML, since HTML is not context-free. 我不认为正则表达式是HTML的理想解决方案,因为HTML不是无上下文的。 They'll probably produce an adequate, if imprecise, result; 他们可能会产生足够的,如果不精确的结果; even deterministically identifying a URI is a messy problem. 甚至确定性地识别URI也是一个混乱的问题。

It is always better, if possible not to rediscover the wheel. 如果可能的话,最好不要重新发现轮子。 Some good tools exist that either convert HTML to well-formed XML, or act as an XmlReader: 存在一些好的工具,可以将HTML转换为格式良好的XML,也可以充当XmlReader:

Here are three good tools: 这里有三个好工具:

  1. TagSoup , an open-source program, is a Java and SAX - based tool, developed by John Cowan . TagSoup是一个开源程序,是由John Cowan开发的基于Java和SAX的工具。 This is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. 这是一个用Java编写的兼容SAX的解析器,它不是解析格式良好或有效的XML,而是解析在野外发现的HTML:糟糕,讨厌和野蛮,尽管通常很短。 TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. TagSoup专为那些必须使用某种理性应用程序设计来处理这些东西的人而设计。 By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. 通过提供SAX接口,它允许将标准XML工具应用于最差的HTML。 TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML. TagSoup还包括一个命令行处理器,它可以读取HTML文件并生成干净的HTML或格式良好的XML,它与XHTML非常接近。
    Taggle is a commercial C++ port of TagSoup. Taggle是TagSoup的商业C ++端口。

  2. SgmlReader is a tool developed by Microsoft's Chris Lovett . SgmlReader是微软的Chris Lovett开发的工具。
    SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). SgmlReader是任何SGML文档的XmlReader API(包括内置的HTML支持)。 A command line utility is also provided which outputs the well formed XML result. 还提供了命令行实用程序,其输出格式良好的XML结果。
    Download the zip file including the standalone executable and the full source code: SgmlReader.zip 下载包含独立可执行文件和完整源代码的zip文件: SgmlReader.zip

  3. An outstanding achievement is the pure XSLT 2.0 Parser of HTML written by David Carlisle . 一个杰出的成就是由David Carlisle编写的纯XSLT 2.0 Parser of HTML

Reading its code would be a great learning exercise for everyone of us. 阅读它的代码对我们每个人来说都是一个很好的学习练习。

From the description: 从描述:

" d:htmlparse(string) d:htmlparse(string)
d:htmlparse(string,namespace,html-mode) d:htmlparse(字符串,命名空间,HTML模式)

The one argument form is equivalent to) 一个参数形式相当于)
d:htmlparse(string,' http://ww.w3.org/1999/xhtml ',true())) d:htmlparse(string,' http ://ww.w3.org/1999/xhtml',true()))

Parses the string as HTML and/or XML using some inbuilt heuristics to) 使用一些内置的启发式方法将字符串解析为HTML和/或XML
control implied opening and closing of elements. 控制暗示元素的开启和关闭。

It doesn't have full knowledge of HTML DTD but does have full list of 它没有HTML DTD的完整知识,但确实有完整的列表
empty elements and full list of entity definitions. 空元素和实体定义的完整列表。 HTML entities, and HTML实体和
decimal and hex character references are all accepted. 十进制和十六进制字符引用都被接受。 Note html-entities 注意html实体
are recognised even if html-mode=false(). 即使html-mode = false(),也会被识别。

Element names are lowercased (if html-mode is true()) and placed into the 元素名称是小写的(如果html-mode为true())并放入
namespace specified by the namespace parameter (which may be "" to denote 命名空间参数指定的命名空间(可以用“”表示
no-namespace unless the input has explict namespace declarations, in 除非输入具有明确的命名空间声明,否则无命名空间
which case these will be honoured. 哪种情况会受到尊重。

Attribute names are lowercased if html-mode=true() " 如果html-mode = true(),属性名称是小写的

Read a more detailed description here . 在此处阅读更详细的说明。

Hope this helped. 希望这有帮助。

Cheers, 干杯,

Dimitre Novatchev. Dimitre Novatchev。

I agree with Chris Lively, because HTML is often not very well formed you probably are best off with a regular expression for this. 我同意Chris Lively的观点,因为HTML通常形式不是很好,你可能最好用正则表达式。

href=[\"\'](http:\/\/|\.\/|\/)?\w+(\.\w+)*(\/\w+(\.\w+)?)*(\/|\?\w*=\w*(&\w*=\w*)*)?[\"\']

From here on RegExLib should get you started 这里开始,RegExLib应该让你入门

You might have more luck using xml if you know or can fix the document to be at least well-formed. 如果您知道或者可以将文档修复为至少格式良好,那么使用xml可能会有更多的运气。 If you have good html (or rather, xhtml), the xml system in .Net should be able to handle it. 如果你有良好的 HTML(或者更确切地说,xhtml),.Net中的xml系统应该能够处理它。 Unfortunately, good html is extremely rare. 不幸的是,好的HTML非常罕见。

On the other hand, regular expressions are really bad at parsing html. 另一方面,正则表达式在解析html时非常糟糕 Fortunately, you don't need to handle a full html spec. 幸运的是,您不需要处理完整的HTML规范。 All you need to worry about is parsing href= strings to get the url. 您需要担心的是解析href= strings以获取url。 Even this can be tricky, so I won't make an attempt at it right away. 即便这样也很棘手,所以我不会马上尝试。 Instead I'll start by asking a few questions to try and establish a few ground rules. 相反,我会首先提出几个问题来尝试建立一些基本规则。 They basically all boil down to "How much do you know about the document?", but here goes: 他们基本上都归结为“你对这份文件了解多少?”,但这里有:

  • Do you know if the "href" text will always be lower case? 你知道“href”文本是否总是小写的吗?
  • Do you know if it will always use double quotes, single quotes, or nothing around the url? 你知道它是否总是使用双引号,单引号或网址周围没有任何内容?
  • Is it always be a valid URL, or do you need to account for things like '#', javascript statements, and the like? 它始终是一个有效的URL,还是需要考虑“#”,javascript语句之类的内容?
  • Is it possible to work with a document where the content describes html features (IE: href= could also be in the document and not belong to an anchor tag)? 是否可以使用内容描述html功能的文档(IE: href=也可能在文档中而不属于锚标记)?
  • What else can you tell us about the document? 您还可以告诉我们有关该文件的其他信息?

I've linked some code here that will let you use "LINQ to HTML"... 我在这里链接了一些代码,可以让你使用“LINQ to HTML”......

Looking for C# HTML parser 寻找C#HTML解析器

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM