简体   繁体   English

如何为有关人员的新闻文章构建.NET Web scraper

[英]How to build .NET web scraper for news articles about people

I am looking to create a simple webservice to crawl webpages on specific websites and look for a person's name. 我希望创建一个简单的Web服务来抓取特定网站上的网页,并寻找一个人的名字。 Anybody know if there are any examples out there of this, or can anyone help me with the start of this? 有人知道是否有任何这方面的例子,或者任何人可以帮助我开始这个?

Edit: I should mention I want to do this with Visual Studio C#. 编辑:我应该提一下我想用Visual Studio C#做到这一点。 I will only be looking at English news sites that I specify. 我只会查看我指定的英语新闻网站。

Here is a simple function that returns true if a Web page contains a person's name: 这是一个简单的函数,如果网页包含人名,则返回true:

string response;
using (System.Net.WebClient wc = new System.Net.WebClient())
{
    response = wc.DownloadString(url);
}  
return reponse.Contains("John Doe");

For finding the links within the page, check out this question: Parse HTML links using C# 要查找页面内的链接,请查看以下问题: 使用C#解析HTML链接
You can collect distinct Urls throughout the site and run the code above for each Url you find. 您可以在整个网站上收集不同的网址,并为您找到的每个网址运行上面的代码。

Also, type this into Google to see what they find. 另外,将其键入Google,以查看他们的发现。 site:www.somesite.com "John Doe"

Using c# your best option for a crawler and parser (the two parts to your solution) would be to use functionality exposed by the HtmlAgility Pack, which can be found on CodePlex. 使用c#,对于搜寻器和解析器(解决方案的两个部分),最好的选择是使用HtmlAgility Pack公开的功能,该功能可以在CodePlex上找到。

Refer to this answer for an example usage scenario: How to use HTML Agility pack 有关示例使用方案,请参阅此答案: 如何使用HTML Agility包

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM