简体   繁体   English

如何在网页上找到内容根标签

[英]How to find the content root tag on a webpage

I need to find the main content of a webpage and extract it to save in the database. 我需要找到网页的主要内容并将其提取以保存在数据库中。

The Evernote Webclipper and other plugins - I use them for Chrome -, to easy the reading of a webpage, are able to find the main content of a webpage and remove all other distractions, reformatting the text in a bigger font size and a more readable font family. Evernote Webclipper和其他插​​件(我在Chrome上使用了它们),可以轻松阅读网页,能够找到网页的主要内容并消除所有其他干扰,以更大的字体大小和更易读的格式重新格式化文本字体系列。

I'd like to build a similar feature as I need to save a particular page of a website, and save only its main content (the terms and service page and the privacy policies), removing sidebars, headers, and so on. 我想构建类似的功能,因为我需要保存网站的特定页面,并且仅保存其主要内容(条款和服务页面以及隐私权政策),删除侧边栏,标题等。

I'm going to build this thing in PHP, using the Symfony's Crawler Component, but I cannot figure out how can I evaluate each single tag to find the main content of the processing webpage. 我将使用Symfony的Crawler组件在PHP中构建此东西,但是我无法弄清楚如何评估每个标签来查找处理网页的主要内容。

Any ideas? 有任何想法吗?

The one that comes to my mind is to count the number of p tags and compute their average lenght, so, the higher the number of p in a tag and the higher their average lenght can give me some sort of guidance... 我想到的是计算p标签的数量并计算它们的平均长度,因此,标签中p的数量越多,它们的平均长度越高,可以给我一些指导...

Search engine results drive content trends of websites. 搜索引擎结果驱动网站的内容趋势。 Search engines attempt to extract meaningful content to display relevant search results. 搜索引擎试图提取有意义的内容以显示相关的搜索结果。 As search engines evolve, web developers work to deliver content that ranks higher and higher. 随着搜索引擎的发展,Web开发人员致力于提供排名越来越高的内容。 This has lead to a nice structure of data in quality sites. 这导致高质量站点中的数据结构良好。

Most tools that extract meaningful content analyze the markup semantically. 提取有意义内容的大多数工具都会对标记进行语义分析。 Search terms of interest are semantic markup and rich snippets . 感兴趣的搜索词是semantic markuprich snippets

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM