How to get only plain text from HTML using C#?

Question

Hi guys.

I'm trying to create an app that will find the most frequently used words in the string. In my case, a string is the HTML. I've already can get HTML from URI. For example for "https://www.bbc.com/news/world-middle-east-57327591".

var url = "https://www.bbc.com/news/world-middle-east-57327591";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);

Html variable has the same HTML as in the Source. That's well.

But how to get rid of all styles, scripts, and additional information. And get only plain text in some string variable?

I want my application not to be only for BBC html, but for every HTML which I can get in the net. I have an idea that I should get text from every element such us <div>,<p>,<b>,<i>,<a> because not all of the text store in the <p> .

Answer 1

As per This answer, try the following:


var url = "https://www.bbc.com/news/world-middle-east-57327591";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
//Create a regex pattern that selects all html tag elements
string pattern = @"<(.|\n)*?>";
//Replace all tag elements found using that regex with  nothing 
return Regex.Replace(htmlString, pattern, string.Empty);

How to get only plain text from HTML using C#?

Question

1 answers

solution1
0 2021-06-02 21:24:08

How to get only plain text from HTML using C#?

Question

1 answers

solution1 0 2021-06-02 21:24:08

solution1
0 2021-06-02 21:24:08