简体   繁体   English

C#从HTML文档中提取内容

[英]C# extract content from HTML document

I was wondering how can I do something similar to Facebook when a link is posted or like shortening link services that can get the title of the page and its content. 我想知道如何在发布链接时做类似于Facebook的事情,或者像缩短可以获得页面标题及其内容的链接服务。

Example: 例:

例

My idea is to get only the plain text from a web page, for example if the url is an article of a newspaper how can I get only the news's text, like showed in the image. 我的想法是只从网页上获取纯文本,例如,如果网址是报纸的文章,我怎么才能得到新闻的文字,如图中所示。 For now I have been trying to use the HtmlAgilityPack but I can never get the text clean. 现在我一直在尝试使用HtmlAgilityPack,但我永远无法将文本清理干净。

Note this app is for Windows Phone 7. 请注意,此应用程序适用于Windows Phone 7。

You're on the right track with HtmlAgilityPack . 你正在使用HtmlAgilityPack走上正轨。

If you want all the text of the website, go for the innerText attribute. 如果您想要网站的所有文本,请转到innerText属性。 But I suggest you go with the meta description tag (if available). 但我建议你使用meta description标签(如果有的话)。

EDIT - Go for the meta description . 编辑 - 转到meta description I believe that's what Facebook is doing: 我相信Facebook正在做的事情:

Facebook link sample Facebook链接样本

Facebook链接样本

Site source 网站来源

网站来源

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM