简体   繁体   English

如何仅使用 c# Web 浏览器加载 html

[英]How to only load html using c# Web Browser

I'm using C# web browser to scrape data from a website.我正在使用 C# 网络浏览器从网站上抓取数据。 The problem is that it takes around 20 minutes to get around 250 records.问题是获取大约 250 条记录需要大约 20 分钟。

What I do programmatically is我以编程方式做的是

1- Get all years inside the dropdown. 1- 在下拉列表中获取所有年份。

2- For each year I make a search. 2- 每年我都会进行搜索。 and scrap data from the table.并从表中删除数据。

3- First cell of the row is a link (details) and rest of the cells have basic information. 3- 行的第一个单元格是链接(详细信息),其余单元格具有基本信息。 So what I do is get the basic information, open the details link in a new browser and get the details.所以我要做的是获取基本信息,在新浏览器中打开详细信息链接并获取详细信息。

4- loop through step 3. 4- 循环执行第 3 步。

I made a performance test to the program and saw that it takes a lot of time for waiting the document to load.我对程序进行了性能测试,看到等待文档加载需要很多时间。 If I skip scraping data from the details page it takes 1.5 minutes to scrap all the data.如果我跳过从详细信息页面抓取数据,则抓取所有数据需要 1.5 分钟。 I use the following method to wait for the document to complete before I start scrapping.我使用以下方法等待文档完成,然后再开始报废。

public async Task WaitPageLoad(int timeOut)
{
    var pageLoaded = new TaskCompletionSource<bool>();
    var timeElapsed = 0;
    DocumentCompleted += (s, e) =>
    {
        if (ReadyState != WebBrowserReadyState.Complete) return;
        if (pageLoaded.Task.IsCompleted) return; pageLoaded.SetResult(true);
    };

    while (pageLoaded.Task.Status != TaskStatus.RanToCompletion)
    {
        await Task.Delay(10); 
        timeElapsed++;
        if (timeElapsed >= timeOut * 100) pageLoaded.TrySetResult(true);
    }
}

So I was wondering if there's any way to make the browser to only load html and not images or something.所以我想知道是否有任何方法可以让浏览器只加载 html 而不是图像或其他东西。

Any help is very appreciated!非常感谢任何帮助!

Why use WebBrowser at all?为什么要使用WebBrowser This is a control used to parse and display content to users.这是一个用于解析并向用户显示内容的控件。 That's not quick by any stretch.这一点都不快。

If all you want is the data (and don't intend to display it) you could simply do something like:如果您想要的只是数据(并且不打算显示它),您可以简单地执行以下操作:

//Gets you the HTML for a given URL synchronously
var data = new System.Net.WebClient().DownloadString(url);

However the above can be more difficult to use depending on the complexity of the page(s) you're trying to scrape.但是,根据您尝试抓取的页面的复杂性,上述内容可能更难使用。

For more advanced web scraping I'd recommend grabbing either HtmlAgilityPack or IronWebScraper from NuGet.对于更高级的网页抓取,我建议从 NuGet 获取HtmlAgilityPackIronWebScraper

Depending on how you login to the website, you need something like this to work with WebClient:根据您登录网站的方式,您需要这样的东西才能使用 WebClient:

WebClient client = new WebClient();
client.Credentials = new NetworkCredential("Username", "Password");
string pageData = client.DownloadString("https://stackoverflow.com/");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM