How to get the text of a website using C#?

Question

I'm trying to get the text out of a website without any source code.

I have this code:

HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create("http://www.google.com");
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();
StreamReader sr = new StreamReader(myResponse.GetResponseStream(), System.Text.Encoding.UTF8);
string result = sr.ReadToEnd();
sr.Close();
myResponse.Close();
Console.WriteLine(result);

Which of course will give me the text but also the source code as well. How shall I dispose of the source code?

Answer 1

I suggest using an HTML parser like the HTML Agility Pack - once the document is loaded to it you can extract the text from the top node using its InnerText property.

Answer 2

If you use PuppeteerSharp you can do it without all the HTTP requests.

await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultChromiumRevision);
var HeadlessBrowser = await Puppeteer.LaunchAsync(new LaunchOptions
{
    Headless = true
});
var WebPage = await HeadlessBrowser.NewPageAsync();
await WebPage.GoToAsync({URL HERE});
var PageContent = await WebPage.EvaluateExpressionAsync<string>("document.body.innerText");
await HeadlessBrowser.CloseAsync();
Console.WriteLine(PageContent)

You can also change the code to make it more streamlined and simple, but this is the basic gist of it.

How to get the text of a website using C#?

Question

2 answers

solution1
4 2012-01-15 12:13:42

solution2
0 2023-06-09 04:56:57

How to get the text of a website using C#?

Question

2 answers

solution1 4 2012-01-15 12:13:42

solution2 0 2023-06-09 04:56:57

solution1
4 2012-01-15 12:13:42

solution2
0 2023-06-09 04:56:57