简体繁体 English

从 php、javascript 中的 url（网页）解析特定内容

[英]parsing specific content from a url(webpage) in php, javascript

原文 2011-06-07 20:47:18 9 2 php/ javascript/ html

I use some RSS feeds.我使用一些 RSS 提要。 Some of them don't have a description for their articles.其中一些没有对他们的文章的描述。

In order not to show just the title and no description for those articles, I would like to show for example the first two paragraphs of the actual article.为了不只显示这些文章的标题而不显示描述，我想展示实际文章的前两段。

I experimented with stripos , file_get_contents but I have a problem.我尝试了stripos ， file_get_contents但我遇到了问题。 In most pages it works fine, but in other pages it grabs the first <p> tag (which can be for example a paragraph in the sidebar) and is irrelevant to the article that is mentioned in the RSS feed.在大多数页面中它可以正常工作，但在其他页面中，它会抓取第一个<p>标记（例如，可以是侧边栏中的段落）并且与 RSS 提要中提到的文章无关。

Any idea about how to grab the main content from a URL strictly in PHP or JavaScript?关于如何在 PHP 或 JavaScript 中严格从 URL 中获取主要内容的任何想法？

Thanks in advance.提前致谢。

2 个解决方案

The first idea that comes to mind is to remove tags from within the p and then only use that section if the length of actual text within the paragraph is greater than a certain threshold.想到的第一个想法是从 p 中删除标签，然后仅在段落中实际文本的长度大于某个阈值时才使用该部分。 Maybe check for a certain number of [.?.] also, If the number isn't there.也可以检查一定数量的 [.?.]，如果不存在的话。 then go to the next one.然后 go 到下一个。

You may also want to try scraping, which allows you to 'scrape' a page and parse its contents.您可能还想尝试抓取，它允许您“抓取”页面并解析其内容。 http://simplehtmldom.sourceforge.net/ has a jQuery-like syntax and should quickly allow you to get just the content you want. http://simplehtmldom.sourceforge.net/具有类似 jQuery 的语法，应该可以让您快速获得所需的内容。

Scraping comes with its own caveats, as some sites may not look kindly on your harvesting of data and may block your attempts.抓取有其自身的警告，因为某些网站可能不善待您收集数据并可能阻止您的尝试。 You may want to look into the pluses and minuses of this method, but it is certainly powerful.您可能想了解这种方法的优缺点，但它确实很强大。

There's also info on scraping RSS feeds here: http://blog.5ubliminal.com/posts/rsscraping-scraping-rss-with-php-dom-xpath/ , which I haven't tried.这里还有关于抓取 RSS 提要的信息： http://blog.5ubliminal.com/posts/rsscraping-scraping-rss-with-php-dom-xpath/ ，我没有尝试过。

EDIT : Wrikken's link is better than mine.编辑： Wrikken 的链接比我的好。 Some good alternatives there.那里有一些不错的选择。