C# extracting html only

Question

Basically i have a webpage with embedded css and JavaScript, so what i want to do is extract only the HTML itself, from texts to tables , images and what not.

So far i have the whole web page stored into a string called "html" the contents of this page is just the facebook hompepage for example,but as you will see there's all scripts and other embedded stuff which i don't want to have.

   HTMLEdit = //webpage I chose to store in here//
   string html = HTMLEdit.DocumentText;
   String result = "this i want to only contain the <head>,<body>,<foot>."

I am only interested in displaying the result witch only contains html, i don't want the JavaScript or css or any other stuff

I have looked at the agility pack but there's no documentation on there website to do this and this is my first ever c# project i have decided to make, so excuse my ignorance if i don't make sense.

Answer 1

See this question HTML Agility Pack strip tags NOT IN whitelist

Maybe adapt that answer, and drop link and script tags.

C# extracting html only

Question

1 answers

solution1
2 ACCPTED 2012-03-31 13:46:13

C# extracting html only

Question

1 answers

solution1 2 ACCPTED 2012-03-31 13:46:13

solution1
2 ACCPTED 2012-03-31 13:46:13