简体繁体 English

在网页中找到“文章”的算法？

[英]algorithm to find 'article' in webpage?

原文 2012-09-13 08:07:26 2 1 algorithm/ html-content-extraction

some browser plugin, like readability can extract the 'article' from a webpage. 一些浏览器插件（如可读性）可以从网页中提取“文章”。 Does anyone has idea about how to do it? 有谁知道如何做？ What's the difference between the real articles and ads or comments? 真实文章与广告或评论之间有什么区别？

1 个解决方案

Well, it depends how you want to define "real articles"... 好吧，这取决于您要如何定义“真实文章” ...

Taking HTML5 into consideration, a webpage is constructed of semantic tags. 考虑到HTML5，网页由语义标记构成。 Pages no longer have to be built with elements like <div> that have exactly no semantic meaning. 页面不再需要使用<div>这样完全没有语义含义的元素来构建。 In HTML5 you may use <section> , <article> , <header> and so on . 在HTML5中，您可以使用<section> ， <article> ， <header> 等等。 Those elements can give an application pretty good sense of what is the main content of a webpage (eg print <article> s and skip <nav> s...) 这些元素可以使应用程序很好地了解网页的主要内容（例如，print <article>而跳过<nav> ...）。

Of course, not many pages use those tags yet. 当然，还没有很多页面使用这些标签。 Furthermore, the tags might get abused and lose their meaning. 此外，标签可能会被滥用并失去其含义。 In that case I'd stick to some statistics, eg selecting the largest elements in a HTML document. 在那种情况下，我会坚持一些统计数据，例如选择HTML文档中最大的元素。 Moreover, if you have to scrape a webpage, you could use a modification of some pattern-matching algorithm, DIPRE for instance. 此外，如果您必须抓取网页，则可以使用某些模式匹配算法的修改，例如DIPRE。