简体   繁体   English

使用PHP在任意HTML中查找重要文本?

[英]Find important text in arbitrary HTML using PHP?

I have some random HTML layouts that contain important text I would like to extract. 我有一些随机的HTML布局,其中包含我想要提取的重要文本。 I cannot just strip_tags() as that will leave a bunch of extra junk from the sidebar/footer/header/etc. 我不能只是strip_tags()因为它会从侧边栏/页脚/标题/等中留下一堆额外的垃圾。

I found a method built in Python and I was wondering if there is anything like this in PHP. 我发现了一个用Python构建方法 ,我想知道在PHP中是否有这样的东西。

The concept is rather simple: use information about the density of text vs. HTML code to work out if a line of text is worth outputting. 这个概念相当简单:如果一行文本值得输出,请使用有关文本密度与HTML代码的信息。 (This isn't a novel idea, but it works!) The basic process works as follows: (这不是一个新颖的想法,但它有效!)基本过程的工作原理如下:

  1. Parse the HTML code and keep track of the number of bytes processed. 解析HTML代码并跟踪处理的字节数。
  2. Store the text output on a per-line, or per-paragraph basis. 以每行或每段为基础存储文本输出。
  3. Associate with each text line the number of bytes of HTML required to describe it. 将每个文本行与描述它所需的HTML字节数相关联。
  4. Compute the text density of each line by calculating the ratio of text t> o bytes. 通过计算文本t> o字节的比率来计算每行的文本密度。
  5. Then decide if the line is part of the content by using a neural network. 然后通过使用神经网络确定该行是否是内容的一部分。

You can get pretty good results just by checking if the line's density is above a fixed threshold (or the average), but the system makes fewer mistakes if you use machine learning - not to mention that it's easier to implement! 只需检查线路的密度是否高于固定阈值(或平均值),您就可以获得相当不错的结果,但如果您使用机器学习,系统会减少错误 - 更不用说它更容易实现了!

Update: I started a bounty for an answer that could pull main content from a random HTML template. 更新:我开始获得一个可以从随机HTML模板中提取主要内容的答案。 Since I can't share the documents I will be using - just pick any random blog sites and try to extract the body text from the layout. 由于我无法共享我将使用的文档 - 只需选择任意随机博客网站并尝试从布局中提取正文。 Remember that the header, sidebar(s), and footer may contain text also. 请记住,标题,侧边栏和页脚也可能包含文本。 See the link above for ideas. 请参阅上面的链接以获取建议。

  • phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library. phpQuery是一个基于jQuery JavaScript库的服务器端,可链接,CSS3选择器驱动的文档对象模型(DOM)API。

UPDATE 2 更新2

  1. many blogs make use of CMS ; 许多博客都使用CMS ;
  2. blogs html structure is the same almost the time. 博客html结构几乎是时候了。
  3. avoid common selectors like #sidebar, #header, #footer, #comments, etc.. 避免常见的选择器,如#sidebar, #header, #footer, #comments, etc..
  4. avoid any widget by tag name script, iframe 通过标记名称script, iframe避免任何小部件
  5. clear well know content like: 清楚知道的内容如:
    1. /\\d+\\scomment(?:[s])/im
    2. /(read the rest|read more).*/im
    3. /(?:.*(?:by|post|submitt?)(?:ed)?.*\\s(at|am|pm))/im
    4. /[^a-z0-9]+/im

search for well know classes and ids: 搜索熟悉的类和ID:

  • typepad.com .entry-content typepad.com .entry .entry-content
  • wordpress.org .post-entry .entry .post wordpress.org .post-entry .entry .post
  • movabletype.com .post movabletype.com .post
  • blogger.com .post-body .entry-content blogger.com .post-body .entry-content
  • drupal.com .content drupal.com .content
  • tumblr.com .post tumblr.com .post
  • squarespace.com .journal-entry-text squarespace.com .journal .journal-entry-text
  • expressionengine.com .entry expressionengine.com .entry
  • gawker.com .post-body gawker.com .post-body

  • Ref: The blog platforms of choice among the top 100 blogs 参考: 前100名博客中选择的博客平台


$selectors = array('.post-body','.post','.journal-entry-text','.entry-content','.content');
$doc = phpQuery::newDocumentFile('http://blog.com')->find($selectors)->children('p,div');

search based on common html structure that look like this: 基于常见的html结构搜索,如下所示:

<div>
<h1|h2|h3|h4|a />
<p|div />
</div>

$doc = phpQuery::newDocumentFile('http://blog.com')->find('h1,h2,h3,h4')->parent()->children('p,div');

Domdocument can be used to parse html documents, which can then be queried through PHP. Domdocument可用于解析html文档,然后可以通过PHP查询。

Edit: wikied 编辑:wikied

I worked on a similar project a while back. 我不久前在一个类似的项目上工作过。 It's not as complex as the Python script but it will do a good job. 它没有Python脚本那么复杂,但它会做得很好。 Check out the Simple HTML PHP Parser 查看Simple HTML PHP Parser

http://simplehtmldom.sourceforge.net/ http://simplehtmldom.sourceforge.net/

Depending on your HTML structure and if you have id's or classes in place you can get a little complicated and use preg_match() to specifically get any information between a certain start and end tag. 根据您的HTML结构,如果您有id或类,您可能会有点复杂,并使用preg_match()专门获取特定开始和结束标记之间的任何信息。 This means that you should know how to write regular expressions. 这意味着您应该知道如何编写正则表达式。

You can also look into a browser emulation PHP class. 您还可以查看浏览器仿真PHP类。 I've done this for page scraping and it works well enough depending on how well formatted the DOM is. 我已经为页面抓取做了这个,它的工作原理很好,具体取决于DOM的格式。 I personally like SimpleBrowser 我个人喜欢SimpleBrowser
http://www.simpletest.org/api/SimpleTest/WebTester/SimpleBrowser.html http://www.simpletest.org/api/SimpleTest/WebTester/SimpleBrowser.html

I have developed a HTML parser and filter PHP package that can be used for that purpose. 我开发了一个HTML解析器和过滤PHP包,可用于此目的。

It consists of a set of classes that can be chained together to perform a series of parsing, filtering and transformation operations in HTML/XML code. 它由一组可以链接在一起的类组成,以便在HTML / XML代码中执行一系列解析,过滤和转换操作。

It was meant to deal with real world pages, so it can deal with malformed tag and data structures, so it can preserve as much as the original document as possible. 它旨在处理真实世界页面,因此它可以处理格式错误的标记和数据结构,因此它可以尽可能多地保留原始文档。

One of the filter classes it comes with can do DTD validation. 它附带的一个过滤器类可以进行DTD验证。 Another can discard insecure HTML tags and CSS to prevent XSS attacks. 另一个可以丢弃不安全的HTML标签和CSS来防止XSS攻击。 Another can simply extract all document links. 另一个可以简单地提取所有文档链接

All those filter classes are optional. 所有这些过滤器类都是可选的。 You can chain them together the way you want, if you need any at all. 如果您需要,可以按照自己的方式将它们链接在一起。

So, to solve your problem, I do not think there is already a specific solution for that in PHP anywhere, but a special filter class could be developed for it. 因此,为了解决您的问题,我认为在任何地方都没有针对PHP的特定解决方案,但可以为它开发一个特殊的过滤器类。 Take a look at the package. 看看包装。 It is thoroughly documented. 它是完整的文件。

If you need help, just check my profile and mail me and I may even develop the filter that does exactly what you need, eventually inspired in any solutions that exist for other languages. 如果您需要帮助,只需检查我的个人资料并给我发邮件,我甚至可以开发出能够满足您需求的过滤器,最终受到其他语言存在的任何解决方案的启发。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM