简体   繁体   English

如何在不使用第三方库的情况下从HTML提取文本?

[英]How I can extract text from HTML without using third-party libraries?

_request = (HttpWebRequest)WebRequest.Create(url);
_response = (HttpWebResponse) _request.GetResponse();
StreamReader streamReader = new StreamReader(_response.GetResponseStream());
string text = streamReader.ReadToEnd();

Text with html tags. 带有html标签的文本。 How i can get text without html tags? 我如何获取没有html标签的文本?

How do you extract text from dynamic HTML without using 3rd party libraries? 如何在不使用第三方库的情况下从动态HTML中提取文本? Simple, you invent your own HTML parsing library using the string parsing functions present in the .NET framework. 很简单,您可以使用.NET框架中提供的字符串解析功能来创建自己的HTML解析库。

Seriously, doing this by yourself is a bad idea. 认真地说,一个人做这是一个坏主意。 If you're pulling dynamic HTML off the web, you have to be prepared for different closing tags, mismatched tags, missing end tags, and so forth. 如果要从网络上提取动态HTML,则必须准备好使用不同的结束标记,不匹配的标记,丢失的结束标记等。 Unless you have a really good reason why you need to write one yourself, just use HTML Agility Pack , and let that do the hard work for you. 除非你有一个很好的理由,为什么你需要自己写一个,只需使用HTML敏捷性包 ,并让该为你做的辛勤工作。

Also, make sure you're not succumbing to Not Invented Here Syndrome . 另外,请确保您不屈从于“ 未在这里发明综合症”

You might want to take a look at HTMLAgilityPack . 您可能想看看HTMLAgilityPack

It's a great free .net lib, which enables you to load and parse HTML. 这是一个很棒的免费.net库,它使您能够加载和解析HTML。 Enjoy. 请享用。

Try this: 尝试这个:

System.Xml.XmlDocument docXML = new System.Xml.XmlDocument();
docXML.Load(url);
string textWithoutTags = docXML.InnerText;

Be happy :) 要开心 :)

This question has been asked before. 这个问题已经被问过了。 There are a few ways to do it, including using a Regular Expression or as pointed out by Adrian, the Agility Pack. 有几种方法可以做到这一点,包括使用正则表达式或Adrian指出的敏捷包。

See this question: How can I strip HTML tags from a string in ASP.NET? 看到这个问题: 如何从ASP.NET中的字符串中剥离HTML标记?

1) Do not use Regular Expressions. 1)不要使用正则表达式。 (see this great StackOverflow post: RegEx match open tags except XHTML self-contained tags ) (请参见出色的StackOverflow帖子: RegEx匹配除XHTML自包含标签之外的其他打开标签

2) Use HtmlAgilityPack. 2)使用HtmlAgilityPack。 But I see you do not want 3rd Party libraries, so we are forced to.... 但是我看到您不需要第三方图书馆,因此我们被迫...

3) Use XmlReader . 3)使用XmlReader You can pretty much use the example code straight from MSDN, and just ignore all cases of XmlNodeType except for XmlNodeType.Text . 你几乎可以直接使用示例代码从MSDN,只是忽略了所有的情况下XmlNodeType除了XmlNodeType.Text For that case simply write your output to a StreamWriter. 在这种情况下,只需将输出写入StreamWriter。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在不使用第三方库的情况下从C#的证书吊销列表中提取序列号列表? - How do I extract the list of serial numbers from a Certificate Revocation List in C# without using third party libraries? 如何在第三方注册中找到客户端IP? - How can I find client IP on Third-party registration? 不使用XNA或其他第三方库的2d Sprite动画 - 2d Sprite Animations without using XNA or other third-party libraries 我如何/在哪里使用.NET DLL发送第三方库? - How/where do I ship third-party libraries with a .NET DLL? 如何使用第三方库正确实现一次性模式 - How to correctly implement disposable pattern with third-party libraries 如何阻止此第三方DLL在我的完全信任Web应用程序中抛出安全例外 - How can I stop this third-party DLL from throwing Security Exceptions in my full trust web application 如何在没有第三方应用程序的情况下使用 SSIS 脚本任务解压缩受密码保护的文件? - How to unzip password-protected files using SSIS script task without third-party app? 如何在不安装的情况下使用第三方 DLL - How to use third-party DLL without installing it 如何使用SAPI 5控制第三方文本语音转换语音? - How to control third-party text-to-speech voices using SAPI 5? 如何在.NET中导入第三方IronPython模块? - How do I import a third-party IronPython module in .NET?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM