简体   繁体   English

使用C#控制台应用程序从网站上抓取数据

[英]scraping data from website with a C# console application

I'm trying to learn Spanish and making some flash cards (for my personal use) to help me learn the verbs. 我正在努力学习西班牙语并制作一些闪存卡(供我个人使用)以帮助我学习动词。

Here is an example, page example . 这是一个示例页面示例 So near the top of the page you will see the past participle: bloqueado & gerund: bloqueando. 所以在页面顶部附近你会看到过去的分词:bloqueado和gerund:bloqueando。 It is these two values that I wish to obtain in my code and use for my flash cards. 我希望在我的代码中获取这两个值并用于我的闪存卡。

If this is possible I will use a C# console application. 如果可以的话,我将使用C#控制台应用程序。 I am aware that scraping data from a website is not ideal however this is a once off. 我知道从网站上抓取数据并不理想,但这是一次性的。

Any guidance on how to start something like this and pitfalls to avoid would be very helpful! 关于如何开始这样的事情和避免陷阱的任何指导将是非常有帮助的!

I know this isn't an exact answer, but here is the process I would suggest. 我知道这不是一个确切的答案,但这是我建议的过程。

  1. https://www.gnu.org/software/wget/ and mirror the website to a folder. https://www.gnu.org/software/wget/并将网站镜像到一个文件夹。 Wget is a web spider and will follow the links on the site until it has downloaded everything. Wget是一个网络蜘蛛,它将跟随网站上的链接,直到它下载了所有内容。 You'll have to run it with a few different parameters until you figure out the correct settings you want. 您必须使用几个不同的参数运行它,直到找到所需的正确设置。
  2. Use C# to run through each file in the folder and extract the words from <section class="verb-mood-section"> in each file. 使用C#运行文件夹中的每个文件,并从每个文件中的<section class="verb-mood-section">中提取单词。 It's your choosing of whether you want to output them to the console or store them in a database or flat file. 您可以选择是将它们输出到控制台还是将它们存储在数据库或平面文件中。

Should be that easy, in theory. 理论上应该这么简单。

Use SGMLReader . 使用SGMLReader SGMLReader is a versatile and robust component that will stream HTML to an XMLReader: SGMLReader是一个多功能且强大的组件,可以将HTML流式传输到XMLReader:

XmlDocument FromHtml(TextReader reader) {

    // setup SgmlReader
    Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
    sgmlReader.DocType = "HTML";
    sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
    sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
    sgmlReader.InputStream = reader;

    // create document
    XmlDocument doc = new XmlDocument();
    doc.PreserveWhitespace = true;
    doc.XmlResolver = null;
    doc.Load(sgmlReader);
    return doc;
}

You can see that you need to create a TextReader first. 您可以看到首先需要创建TextReader。 TThis would in reality be a StreamReader as a TextReader is an abstract class. 这实际上是一个StreamReader,因为TextReader是一个抽象类。

Then you create the XMLDocument over that. 然后在其上创建XMLDocument。 Once you've got it into the XMLDocument you can use the various methods supported by XMLDocument to isolate and extract the nodes you need. 一旦进入XMLDocument,就可以使用XMLDocument支持的各种方法来隔离和提取所需的节点。 I'll leave you to explore that aspect of it. 我会让你去探索它的那个方面。

You might try using the XDocument class as it's a lot easier to handle than the XMLDocument, especially if you're a newbie. 您可以尝试使用XDocument类,因为它比XMLDocument更容易处理,特别是如果您是新手。 It also supports LINQ. 它还支持LINQ。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 HtmlAgilityPack 从 C# 中的网站抓取特定数据 - Scraping specific pieces of data from website in C# using HtmlAgilityPack 从C#控制台应用程序查询网站数据库 - Querying a website database from a C# console application 将数据从Web应用程序提交到C#控制台应用程序 - Submitting data from a web application to C# console application 从 C# 控制台应用程序中的 Powershell 脚本读取 json 数据 - Read json data from Powershell script in c# console application 如何从C#控制台应用程序导出csv文件中的数据? - How to export data in csv file from C# console application? 如何从 observablecollection 中获取数据并显示到 C# 中的控制台应用程序中? - How to get data from observablecollection and display into console application in c#? 如何将文件数据从C#Console应用程序传递到WebApi? - How to pass the File Data from C# Console application to WebApi? 从 Azure 数据工厂调用 c# 控制台应用程序 - Calling a c# console application from Azure Data Factory 通过从控制台应用程序C#登录,从网站的导航链接下载文件。 - download the file from the navigation link from the website by logging in from Console application C# 是否可以将数据从C#应用程序发送到网站/网络服务器? - Is it possible to send data from a C# application to a website/webserver?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM