简体   繁体   English

C#只提取html

[英]C# extracting html only

Basically i have a webpage with embedded css and JavaScript, so what i want to do is extract only the HTML itself, from texts to tables , images and what not. 基本上我有一个嵌入了CSS和JavaScript的网页,所以我想要做的只是提取HTML本身,从文本到表格,图像和什么不是。

So far i have the whole web page stored into a string called "html" the contents of this page is just the facebook hompepage for example,but as you will see there's all scripts and other embedded stuff which i don't want to have. 到目前为止,我将整个网页存储到一个名为“html”的字符串中,例如,这个页面的内容只是facebook hompepage,但正如您将看到的那样,我不想拥有所有脚本和其他嵌入的东西。

   HTMLEdit = //webpage I chose to store in here//
   string html = HTMLEdit.DocumentText;
   String result = "this i want to only contain the <head>,<body>,<foot>."

I am only interested in displaying the result witch only contains html, i don't want the JavaScript or css or any other stuff 我只对显示结果只对包含html,我不想要JavaScript或CSS或任何其他东西感兴趣

I have looked at the agility pack but there's no documentation on there website to do this and this is my first ever c# project i have decided to make, so excuse my ignorance if i don't make sense. 我看过敏捷包,但是那里没有关于这个网站的文档来做这个,这是我决定做的第一个c#项目,所以请原谅我的无知,如果我没有意义的话。

See this question HTML Agility Pack strip tags NOT IN whitelist 请参阅此问题HTML Agility Pack strip标签NOT IN白名单

Maybe adapt that answer, and drop link and script tags. 也许适应那个答案,并删除链接和脚本标签。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM