C＃只提取html

Question

Basically i have a webpage with embedded css and JavaScript, so what i want to do is extract only the HTML itself, from texts to tables , images and what not. 基本上我有一个嵌入了CSS和JavaScript的网页，所以我想要做的只是提取HTML本身，从文本到表格，图像和什么不是。

So far i have the whole web page stored into a string called "html" the contents of this page is just the facebook hompepage for example,but as you will see there's all scripts and other embedded stuff which i don't want to have. 到目前为止，我将整个网页存储到一个名为“html”的字符串中，例如，这个页面的内容只是facebook hompepage，但正如您将看到的那样，我不想拥有所有脚本和其他嵌入的东西。

   HTMLEdit = //webpage I chose to store in here//
   string html = HTMLEdit.DocumentText;
   String result = "this i want to only contain the <head>,<body>,<foot>."

I am only interested in displaying the result witch only contains html, i don't want the JavaScript or css or any other stuff 我只对显示结果只对包含html，我不想要JavaScript或CSS或任何其他东西感兴趣

I have looked at the agility pack but there's no documentation on there website to do this and this is my first ever c# project i have decided to make, so excuse my ignorance if i don't make sense. 我看过敏捷包，但是那里没有关于这个网站的文档来做这个，这是我决定做的第一个c＃项目，所以请原谅我的无知，如果我没有意义的话。

Answer 1

See this question HTML Agility Pack strip tags NOT IN whitelist 请参阅此问题HTML Agility Pack strip标签NOT IN白名单

Maybe adapt that answer, and drop link and script tags. 也许适应那个答案，并删除链接和脚本标签。

C＃只提取html

问题描述

1 个解决方案

解决方案1
2 已采纳 2012-03-31 13:46:13

C＃只提取html

问题描述

1 个解决方案

解决方案1 2 已采纳 2012-03-31 13:46:13

解决方案1
2 已采纳 2012-03-31 13:46:13