如何在不使用第三方库的情况下从HTML提取文本？

Question

_request = (HttpWebRequest)WebRequest.Create(url);
_response = (HttpWebResponse) _request.GetResponse();
StreamReader streamReader = new StreamReader(_response.GetResponseStream());
string text = streamReader.ReadToEnd();

Text with html tags. 带有html标签的文本。 How i can get text without html tags? 我如何获取没有html标签的文本？

Answer 1

How do you extract text from dynamic HTML without using 3rd party libraries? 如何在不使用第三方库的情况下从动态HTML中提取文本？ Simple, you invent your own HTML parsing library using the string parsing functions present in the .NET framework. 很简单，您可以使用.NET框架中提供的字符串解析功能来创建自己的HTML解析库。

Seriously, doing this by yourself is a bad idea. 认真地说，一个人做这是一个坏主意。 If you're pulling dynamic HTML off the web, you have to be prepared for different closing tags, mismatched tags, missing end tags, and so forth. 如果要从网络上提取动态HTML，则必须准备好使用不同的结束标记，不匹配的标记，丢失的结束标记等。 Unless you have a really good reason why you need to write one yourself, just use HTML Agility Pack , and let that do the hard work for you. 除非你有一个很好的理由，为什么你需要自己写一个，只需使用HTML敏捷性包 ，并让该为你做的辛勤工作。

Also, make sure you're not succumbing to Not Invented Here Syndrome . 另外，请确保您不屈从于“ 未在这里发明综合症” 。

Answer 2

You might want to take a look at HTMLAgilityPack . 您可能想看看HTMLAgilityPack 。

It's a great free .net lib, which enables you to load and parse HTML. 这是一个很棒的免费.net库，它使您能够加载和解析HTML。 Enjoy. 请享用。

Answer 3

Try this: 尝试这个：

System.Xml.XmlDocument docXML = new System.Xml.XmlDocument();
docXML.Load(url);
string textWithoutTags = docXML.InnerText;

Be happy :) 要开心：）

Answer 4

This question has been asked before. 这个问题已经被问过了。 There are a few ways to do it, including using a Regular Expression or as pointed out by Adrian, the Agility Pack. 有几种方法可以做到这一点，包括使用正则表达式或Adrian指出的敏捷包。

See this question: How can I strip HTML tags from a string in ASP.NET? 看到这个问题：如何从ASP.NET中的字符串中剥离HTML标记？

Answer 5

1) Do not use Regular Expressions. 1）不要使用正则表达式。 (see this great StackOverflow post: RegEx match open tags except XHTML self-contained tags ) （请参见出色的StackOverflow帖子： RegEx匹配除XHTML自包含标签之外的其他打开标签）

2) Use HtmlAgilityPack. 2）使用HtmlAgilityPack。 But I see you do not want 3rd Party libraries, so we are forced to.... 但是我看到您不需要第三方图书馆，因此我们被迫...

3) Use XmlReader . 3）使用XmlReader 。 You can pretty much use the example code straight from MSDN, and just ignore all cases of XmlNodeType except for XmlNodeType.Text . 你几乎可以直接使用示例代码从MSDN，只是忽略了所有的情况下XmlNodeType除了XmlNodeType.Text 。 For that case simply write your output to a StreamWriter. 在这种情况下，只需将输出写入StreamWriter。

如何在不使用第三方库的情况下从HTML提取文本？

问题描述

5 个解决方案

解决方案1
3 2011-11-29 22:29:10

解决方案2
2 已采纳 2011-11-29 20:55:12

解决方案3
1 2018-03-08 10:50:18

解决方案4
1 2011-11-29 20:59:36

解决方案5
1 2011-11-29 22:02:01

如何在不使用第三方库的情况下从HTML提取文本？

问题描述

5 个解决方案

解决方案1 3 2011-11-29 22:29:10

解决方案2 2 已采纳 2011-11-29 20:55:12

解决方案3 1 2018-03-08 10:50:18

解决方案4 1 2011-11-29 20:59:36

解决方案5 1 2011-11-29 22:02:01

解决方案1
3 2011-11-29 22:29:10

解决方案2
2 已采纳 2011-11-29 20:55:12

解决方案3
1 2018-03-08 10:50:18

解决方案4
1 2011-11-29 20:59:36

解决方案5
1 2011-11-29 22:02:01