简体   繁体   English

从具有C#加载页面的网站获取HTML代码

[英]Get HTML code from a website that has a loading page in C#

I am using the code from this post: Get HTML code from website in C# 我正在使用本文中的代码: 从C#网站获取HTML代码

to save the html in a string 将html保存为字符串

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
    Stream receiveStream = response.GetResponseStream();
    StreamReader readStream;
    if (response.CharacterSet == null)
        readStream = new StreamReader(receiveStream);
    else
        readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
    string data = readStream.ReadToEnd();
    response.Close();
    readStream.Close();

    msgBox.Text = data;
}

However the page I am trying to read has a temporary loader page, how can I get around this that it tries to save the html again after this page is actually loaded? 但是,我要阅读的页面有一个临时加载器页面,如何解决这个问题,即在实际加载该页面后,它会尝试再次保存html?

Best regards 最好的祝福

the page I am trying to read has a temporary loader page 我尝试读取的页面有一个临时加载器页面

It all depends on what that means and how that "temporary loader page" works. 这完全取决于它的含义以及“临时加载程序页面”的工作方式。 For example, if that page is (whether from JavaScript code or some HTML META redirect) making a request to the destination page, than that request is what you need to capture. 例如,如果该页面(无论是从JavaScript代码还是HTML META重定向)向目标页面发出请求,那么您需要捕获请求。 Currently you're reading from a given URL: 目前,您正在从给定的URL阅读:

(HttpWebRequest)WebRequest.Create(url)

This is essentially making a GET request to that URL and reading the response. 这实际上是对该URL发出GET请求并读取响应。 But based on your description it sounds like that's the wrong URL. 但是根据您的描述,听起来这是错误的URL。 It sounds like there's a second URL which contains the actual information you're looking for. 听起来好像有第二个 URL,其中包含您要查找的实际信息。

Given that, you essentially have two options: 鉴于此,您实际上有两个选择:

  1. Determine what that other URL is manually from visiting the page and inspecting the requests in your browser and use that as the value of url in your code. 通过访问页面并检查浏览器中的请求,手动确定其他URL是什么,并将其用作代码中url的值。
  2. Determine how that other URL is itself determined by the page code of the first URL (is it something embedded in the page source somewhere?), parse it out of the response you get from the first url value, and make a second request to the new URL. 确定其他URL本身是如何由第一个URL的页面代码确定的(它是否嵌入在页面源代码中的某处?),将其从您从第一个url值获得的响应中解析出来,然后向新网址。

Clearly the first option is a lot easier. 显然,第一种选择要容易得多。 The second is only necessary if that second URL changes with each visit or is expected to change frequently over time. 仅当第二个URL随每次访问而变化或预期随时间频繁变化时,才需要第二个URL。 If that's the case then you'd have to basically reverse-engineer how the website is performing the second request so you can perform it as well. 如果是这样,那么您就必须对网站如何执行第二个请求进行反向工程,以便您也可以执行它。

Web scraping can get complicated pretty quickly, and often turns into a game of cat and mouse (even unintentionally and mutually unaware) between the person scraping the content and the person hosting the content (who might not want it to be scraped). Web抓取很快就会变得很复杂,并且经常变成抓取内容的人与托管内容的人(可能不希望抓取内容的人)之间的猫捉老鼠的游戏(甚至是无意识且相互不了解的)。

你为什么不使用网络浏览器并延迟

await Task.Delay(n)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM