简体   繁体   English

如何使用AngleSharp和LINQ从网站中提取数据?

[英]How to extract data from website using AngleSharp & LINQ?

I'm trying to extract the prices from the below mentioned website. 我试图从下面提到的网站中提取价格。 I'm using AngleSharp for the extraction. 我正在使用AngleSharp进行提取。 In the website, the prices are listed below (as an example): 在网站上,价格如下(例如):

<span class="c-price">650.00                            </span>

I'm using the following code for the extraction. 我正在使用以下代码进行提取。

using AngleSharp.Parser.Html;
using System.Net;
using System.Net.Http

//Make the request
var uri = "https://meadjohnson.world.tmall.com/search.htm?search=y&orderType=defaultSort&scene=taobao_shop";
var cancellationToken = new CancellationTokenSource();
var httpClient = new HttpClient();
var request = await httpClient.GetAsync(uri);
cancellationToken.Token.ThrowIfCancellationRequested();

//Get the response stream
var response = await request.Content.ReadAsStreamAsync();
cancellationToken.Token.ThrowIfCancellationRequested();

//Parse the stream
var parser = new HtmlParser();
var document = parser.Parse(response);

//Do something with LINQ
var pricesListItemsLinq = document.All
     .Where(m => m.LocalName == "span" && m.ClassList.Equals("c-price"));
Console.WriteLine(pricesListItemsLinq.Count());

However, I'm not getting any items, but they are there on the website. 但是,我没有收到任何物品,但他们在网站上。 What am I doing wrong? 我究竟做错了什么? If AngleSharp isn't the recommended method, what should I use? 如果AngleSharp不是推荐的方法,我应该使用什么? And what code should I use? 我应该使用什么代码?

I am late at the party, but I try to bring some sanity here. 我在派对上很晚,但我试着在这里带来一些理智。

Querying static webpages 查询静态网页

For this we require the following set of tools / functionality: 为此,我们需要以下一组工具/功能:

  • HTTP requester (to obtain resources, eg, HTML documents, via HTTP), potentially with a SSL/TLS layer on top (either accepting all certificates or working against the certificate store / known CAs) HTTP请求者(通过HTTP获取资源,例如HTML文档),可能在顶部有SSL / TLS层(接受所有证书或对证书存储/已知CA工作)
  • HTML parser HTML解析器
  • A queryable object model representation of the parsed HTML document 解析的HTML文档的可查询对象模型表示
  • Maybe additionally some cookie state and the ability to follow links / post forms 也许还有一些cookie状态和跟踪链接/发布表单的能力

AngleSharp gives us all these options (minus a connection to the certificate store / known CAs; so in order to use HTTPS we must do some additional configuration, eg, to accept all certificates). AngleSharp为我们提供了所有这些选项(减去与证书存储/已知CA的连接;因此,为了使用HTTPS,我们必须进行一些额外的配置,例如,接受所有证书)。

We would start by creating an AngleSharp configuration that defines which capabilities are available for the browsing engine. 我们首先创建一个AngleSharp配置,定义哪些功能可用于浏览引擎。 This engine is exposed in form of a "browsing context", which can be regarded as a headless tab. 该引擎以“浏览上下文”的形式公开,可以将其视为无头标签。 In this tab we can open a new document (either from a local source, a constructed source, or a remote source). 在此选项卡中,我们可以打开一个新文档(来自本地源,构造源或远程源)。

var config = Configuration.Default.WithDefaultLoader();
var context = BrowsingContext.New(config);
var document = await context.OpenAsync("http://example.com");

Once we have the document we can use CSS query selectors to obtain certain elements. 一旦我们有了文档,我们就可以使用CSS查询选择器来获取某些元素。 These elements can be used to gather the information we look for. 这些元素可用于收集我们查找的信息。

AngleSharp embraces LINQ (or IEnumerable in general), however, it makes sense to give full power to the queries if possible. AngleSharp包含LINQ(或一般的IEnumerable),但是,如果可能的话,为查询提供全部功能是有意义的。

So instead of 而不是

var pricesListItemsLinq = document.All
    .Where(m => m.LocalName == "span" && m.ClassList.Equals("c-price"));

We write 我们写

var pricesListItemsLinq = document.QuerySelectorAll("span.c-price");

This is also much more robust (the ClassList is anyway a complex object giving access to a list of classes, so you either meant ClassList.Contains or ClassName.Equals (the latter being the string representation). Note: The two versions are not equivalent, because the former is looking for a class within the list of classes, while the latter is looking for a match of the whole class serialization (thus posing some extra boundary conditions on the match; it needs to be the only class). 这也更加健壮( ClassList无论如何都是一个复杂的对象,可以访问类列表,所以你要么意味着ClassList.Contains ,要么是ClassName.Equals (后者是字符串表示)。注意:这两个版本不相同,因为前者正在寻找类列表中的类,而后者正在寻找整个类序列化的匹配(因此在匹配上构成一些额外的边界条件;它需要是唯一的类)。

Dealing with dynamic pages 处理动态页面

This is far more complicated. 这要复杂得多。 The basics are the same as previously, but the engine needs to deliver a lot more than just the previously mentioned requirements. 基础知识与以前相同,但引擎需要提供的不仅仅是前面提到的要求。 Additionally, we need 另外,我们需要

  • A JavaScript engine 一个JavaScript引擎
  • A valid CSSOM 一个有效的CSSOM
  • A fake (or even fully computed) rendering tree 伪造的(甚至是完全计算的)渲染树
  • A lot more DOM interfaces that can be found in real browsers (eg, navigator, full history, web workers, ...) - the list is limitless here 在真实浏览器中可以找到更多DOM接口(例如,导航器,完整历史记录,Web工作者......) - 列表在这里是无限的

While there is a project that delivers an experimental (and limited) C# only JS engine to AngleSharp, the latter two requirements cannot be fully fulfilled right now. 虽然有一个项目为AngleSharp提供了一个实验性 (且有限的)仅限C#的JS引擎,但后两个要求目前还无法完全实现。 Furthermore, the CSSOM may also be not complete enough for one or the other web application. 此外,CSSOM对于一个或另一个Web应用程序也可能不够完整。 Keep in mind that these pages are potentially designed for real browsers. 请记住,这些页面可能是为真正的浏览器设计的。 They make certain assumptions. 他们做出某些假设。 They may even require user input (eg, Google Captcha). 他们甚至可能需要用户输入(例如,Google Captcha)。

Long story short. 长话短说。

var config = Configuration.Default
    .WithDefaultLoader()
    .WithCss()
    .WithJavaScript(); // maybe even more
var context = BrowsingContext.New(config);

The Task behind the await when opening a new document is equivalent to a load event in the DOM. 打开新文档时await的任务等同于DOM中的load事件。 Thus it will not fire when the document was downloaded and parsed, but only once all scripts have been loaded (and potentially run) incl. 因此,在下载和解析文档时不会触发,但只有在加载(并且可能运行)所有脚本时才会触发。 resources that needed to be downloaded. 需要下载的资源。

Hope this helps a bit! 希望这个对你有帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM