简体   繁体   中英

How to get only plain text from HTML using C#?

Hi guys.


I'm trying to create an app that will find the most frequently used words in the string. In my case, a string is the HTML. I've already can get HTML from URI. For example for "https://www.bbc.com/news/world-middle-east-57327591".


var url = "https://www.bbc.com/news/world-middle-east-57327591";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);

Html variable has the same HTML as in the Source. That's well.

But how to get rid of all styles, scripts, and additional information. And get only plain text in some string variable?

I want my application not to be only for BBC html, but for every HTML which I can get in the net. I have an idea that I should get text from every element such us <div>,<p>,<b>,<i>,<a> because not all of the text store in the <p> .

As per This answer, try the following:


var url = "https://www.bbc.com/news/world-middle-east-57327591";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
//Create a regex pattern that selects all html tag elements
string pattern = @"<(.|\n)*?>";
//Replace all tag elements found using that regex with  nothing 
return Regex.Replace(htmlString, pattern, string.Empty);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM