使用C＃控制台应用程序从网站上抓取数据

Question

I'm trying to learn Spanish and making some flash cards (for my personal use) to help me learn the verbs. 我正在努力学习西班牙语并制作一些闪存卡（供我个人使用）以帮助我学习动词。

Here is an example, page example . 这是一个示例页面示例。 So near the top of the page you will see the past participle: bloqueado & gerund: bloqueando. 所以在页面顶部附近你会看到过去的分词：bloqueado和gerund：bloqueando。 It is these two values that I wish to obtain in my code and use for my flash cards. 我希望在我的代码中获取这两个值并用于我的闪存卡。

If this is possible I will use a C# console application. 如果可以的话，我将使用C＃控制台应用程序。 I am aware that scraping data from a website is not ideal however this is a once off. 我知道从网站上抓取数据并不理想，但这是一次性的。

Any guidance on how to start something like this and pitfalls to avoid would be very helpful! 关于如何开始这样的事情和避免陷阱的任何指导将是非常有帮助的！

Answer 1

I know this isn't an exact answer, but here is the process I would suggest. 我知道这不是一个确切的答案，但这是我建议的过程。

https://www.gnu.org/software/wget/ and mirror the website to a folder. https://www.gnu.org/software/wget/并将网站镜像到一个文件夹。 Wget is a web spider and will follow the links on the site until it has downloaded everything. Wget是一个网络蜘蛛，它将跟随网站上的链接，直到它下载了所有内容。 You'll have to run it with a few different parameters until you figure out the correct settings you want. 您必须使用几个不同的参数运行它，直到找到所需的正确设置。
Use C# to run through each file in the folder and extract the words from <section class="verb-mood-section"> in each file. 使用C＃运行文件夹中的每个文件，并从每个文件中的<section class="verb-mood-section">中提取单词。 It's your choosing of whether you want to output them to the console or store them in a database or flat file. 您可以选择是将它们输出到控制台还是将它们存储在数据库或平面文件中。

Should be that easy, in theory. 理论上应该这么简单。

Answer 2

Use SGMLReader . 使用SGMLReader 。 SGMLReader is a versatile and robust component that will stream HTML to an XMLReader: SGMLReader是一个多功能且强大的组件，可以将HTML流式传输到XMLReader：

XmlDocument FromHtml(TextReader reader) {

    // setup SgmlReader
    Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
    sgmlReader.DocType = "HTML";
    sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
    sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
    sgmlReader.InputStream = reader;

    // create document
    XmlDocument doc = new XmlDocument();
    doc.PreserveWhitespace = true;
    doc.XmlResolver = null;
    doc.Load(sgmlReader);
    return doc;
}

You can see that you need to create a TextReader first. 您可以看到首先需要创建TextReader。 TThis would in reality be a StreamReader as a TextReader is an abstract class. 这实际上是一个StreamReader，因为TextReader是一个抽象类。

Then you create the XMLDocument over that. 然后在其上创建XMLDocument。 Once you've got it into the XMLDocument you can use the various methods supported by XMLDocument to isolate and extract the nodes you need. 一旦进入XMLDocument，就可以使用XMLDocument支持的各种方法来隔离和提取所需的节点。 I'll leave you to explore that aspect of it. 我会让你去探索它的那个方面。

You might try using the XDocument class as it's a lot easier to handle than the XMLDocument, especially if you're a newbie. 您可以尝试使用XDocument类，因为它比XMLDocument更容易处理，特别是如果您是新手。 It also supports LINQ. 它还支持LINQ。

使用C＃控制台应用程序从网站上抓取数据

问题描述

2 个解决方案

解决方案1
0 2017-04-06 12:54:15

解决方案2
0 2019-01-14 13:11:07

使用C＃控制台应用程序从网站上抓取数据

问题描述

2 个解决方案

解决方案1 0 2017-04-06 12:54:15

解决方案2 0 2019-01-14 13:11:07

解决方案1
0 2017-04-06 12:54:15

解决方案2
0 2019-01-14 13:11:07