简体   繁体   中英

How to get the text of a website using C#?

I'm trying to get the text out of a website without any source code.

I have this code:

HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create("http://www.google.com");
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();
StreamReader sr = new StreamReader(myResponse.GetResponseStream(), System.Text.Encoding.UTF8);
string result = sr.ReadToEnd();
sr.Close();
myResponse.Close();
Console.WriteLine(result);

Which of course will give me the text but also the source code as well. How shall I dispose of the source code?

I suggest using an HTML parser like the HTML Agility Pack - once the document is loaded to it you can extract the text from the top node using its InnerText property.

If you use PuppeteerSharp you can do it without all the HTTP requests.

await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultChromiumRevision);
var HeadlessBrowser = await Puppeteer.LaunchAsync(new LaunchOptions
{
    Headless = true
});
var WebPage = await HeadlessBrowser.NewPageAsync();
await WebPage.GoToAsync({URL HERE});
var PageContent = await WebPage.EvaluateExpressionAsync<string>("document.body.innerText");
await HeadlessBrowser.CloseAsync();
Console.WriteLine(PageContent)

You can also change the code to make it more streamlined and simple, but this is the basic gist of it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM