简体   繁体   English

只获取网站的相关部分

[英]Get only relevant portion of website

How does Evernote's Web Clipper plugin or Announcify plugin only get relevant article/post/content part of the page? Evernote的Web Clipper插件Announcify插件如何才能获得页面的相关文章/帖子/内容部分? Here is an screenshot from evernote plugin: 这是evernote插件的截图:

在此输入图像描述

No matter which website you visit which is completely different from other layout wises, these are always able to get you article/post/content part of the page. 无论您访问哪个网站与其他布局都完全不同,这些网站始终能够为您提供文章/帖子/内容部分。

Each website has different layouts, some have sidebar, some don't, different tags, for main/article/content part, some use <article> or <section> of HTML5 others use <h1> > <p> , some use <h2> > <p> and others don't use at all. 每个网站都有不同的布局,有些有侧栏,有些没有,不同的标签,主要/文章/内容部分,有些使用<article><section>的HTML5其他人使用<h1> > <p> ,有些使用<h2> > <p>和其他人根本不使用。 So there are different combination of tags as well as layouts of websites. 因此,有不同的标签组合以及网站的布局。

Can anyone suggest a solution to getting main article/post/content please via Javascript or PHP? 有人可以通过Javascript或PHP建议获得主要文章/帖子/内容的解决方案吗?

You can do a simple DOM parsing and search for the <div> s and <p> s containing more text ( text! not HTML code! ). 您可以进行简单的DOM解析并搜索包含更多文本的<div><p>文本!而不是HTML代码! )。 However, regardless of the intelligent method you will choose for determining where the content is, you should start from DOM parsing , so let's have a look at DOM parsing PHP libraries. 但是,无论您选择哪种智能方法来确定内容的位置,都应该从DOM解析开始,所以让我们看一下DOM解析PHP库。

Anyway, you can start from this: 无论如何,你可以从这开始:

http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/ http://w-shadow.com/blog/2008/01/25/extracting-the-main-content-from-a-webpage/

Looks quite good, and gives technical explanations if you want to write something of your own. 看起来相当不错,如果你想写自己的东西,给出技术解释。

Most blog engines give that div an id of 'content'. 大多数博客引擎都会将该div视为“内容”的ID。

  • In javascript you would just do $('#content') 在javascript中你只需要$('#content')
  • In php you would do DOMDocument::getElementById('content'). 在php中你会做DOMDocument :: getElementById('content')。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM