简体   繁体   English

从网页中获取一些数据

[英]Fetch some data from a webpage

I have used this tutorial to fetch all the content of some webpage via c# code . 我已使用本教程通过c#代码获取某些网页的所有内容

I now want to gather into an IEnumerable collection all the strings which are decorated in the following text pattern: (ie MY-TEXT) 我现在想将以以下文本模式修饰的所有字符串收集到IEnumerable集合中:(即MY-TEXT)

data-address=" MY-TEXT "></

How can I do that? 我怎样才能做到这一点? I tried using "string.split()" but got to many "white noises". 我尝试使用“ string.split()”,但遇到了许多“白噪声”。

Any idea? 任何想法?

A better solution is to use HtmlAgilityPack and let it handle the parsing/scraping for you. 更好的解决方案是使用HtmlAgilityPack,并让它为您处理解析/抓取。 Here is an example: 这是一个例子:

var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");

var nodes = doc.DocumentNode.SelectNodes("//[@data-address]");

foreach (var node in nodes)
{
    Console.WriteLine(node.Attributes["data-address"].Value);
}

This will fetch stackoverflow.com, find all elements which has a data-address attribute and then print the value of that attribute. 这将获取stackoverflow.com,查找具有data-address属性的所有元素,然后打印该属性的值。

如果页面格式正确,则将内容加载到XDocument中,并使用LINQ to XML在其上进行查询。

@alexn is right. @alexn是正确的。 A small correction though: 不过有一个小修正:

  var nodes = doc.DocumentNode.SelectNodes("//*[@data-address]");

added the * 添加了*

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM