简体   繁体   English

如何用Php建立一个收集文章的网站?

[英]How to build a website with Php that collects articles?

I have a quick question. 我有一个快速的问题。 I'm trying to build a website with php that collects articles from different blogs. 我正在尝试用php建立一个网站,以收集来自不同博客的文章。 How would I code this in php? 我将如何在php中编写代码? Would I need some type of regex statement? 我需要某种类型的正则表达式语句吗? All I need to do is grab the articles from the specific pages. 我需要做的就是从特定页面中获取文章。 An example is: http://rss.news.yahoo.com/rss/education Can anyone help? 例如: http: //rss.news.yahoo.com/rss/education有人可以帮忙吗? Thank you. 谢谢。

An RSS feed is XML and so you'd use something like the xml_parse_into_struct to begin parsing this feed. RSS提要是XML,因此您将使用xml_parse_into_struct类的内容来开始解析此提要。 The examples on this page should be good enough to get you going. 本页上的示例应该足以帮助您入门。

You need to write parser for each and every site. 您需要为每个站点编写解析器。 Something like this... 像这样

class Parser_Article_SarajevoX extends Parser_Article implements Parser_Interface_Article {

    protected static $_url = 'http://www.sarajevo-x.com/';

    public static function factory($url)
    {
        return new Parser_Article_SarajevoX($url);
    }

    protected static function decode($string)
    {
        return iconv('ISO-8859-2', Kohana::$charset, $string);
    }

    /**
     * SarajevoX Article Parser constructor
     *
     * @param   string  article's url or uri
     */
    public function __construct($url)
    {
        $parsed = parse_url($url);

        if ($path = arr::get($parsed, 'path'))
        {
            // make url's and uri's path the same
            $path = trim($path, '/');

            $exploded = explode('/', $path);

            if (count($exploded == 4))
            {
                list($this->cat_main, $this->cat, $nita, $this->id) = $exploded;
            }
            elseif (count($exploded) == 3)
            {
                list($this->cat, $nita, $this->id) = $exploded;
            }
            else
            {
                throw new Exception("Path not recognized: :url", array(':url' => $url));
            }

            // @todo check if this article is already imported to skip getting HTML

            $html = HTML_Parser::factory(self::$_url.$path);

            $content = $html->find('#content-main .content-bg', 0);

            // @freememory
            $html = NULL;

            $this->title = self::decode($content->find('h1', 0)->innertext);

            // Loop through all inner divs and find the content
            foreach ($content->find('div') as $div)
            {
                switch ($div->class)
                {
                    case 'nadnaslov':

                        $this->suptitle = strip_tags(self::decode($div->innertext));

                    break;
                    case 'uvod':

                        $this->subtitle = strip_tags(self::decode($div->innertext));

                    break;
                    case 'tekst':

                        $pic_wrap = $div->find('div[id="fotka"]', 0);

                        if ($pic_wrap != FALSE)
                        {
                            $this->_pictures[] = array
                            (
                                'url'   =>  self::$_url.trim($pic_wrap->find('img', 0)->src, '/'),
                                'desc'  =>  self::decode($pic_wrap->find('div[id="opisslike"]', 0)->innertext),
                            );

                            // @freememory
                            $pic_wrap   = NULL;
                        }

                        $this->content  = strip_tags(self::decode($div->innertext));

                    break;
                    case 'ad-gallery' :

                        foreach ($div->find('div[id="gallery"] .ad-nav .ad-thumbs ul li a') as $a)
                        {
                            $this->_pictures[] = array
                            (
                                'url'   =>  self::$_url.trim($a->href, '/'),
                                'desc'  =>  self::decode($a->find('img', 0)->alt),
                            );

                            // @freememory
                            $a = NULL;
                        }

                    break;
                }
            }

            echo Kohana::debug($this);

            return;
        }

        throw new Exception("Path not recognized: :url", array(':url' => $url));
    }

}

Each blog has an associated rss xml file. 每个博客都有一个关联的rss xml文件。 The blog page will have a "link" tag pointing to this xml file in its header, so that browsers can allow users to subscribe to those rss feeds. 博客页面的标题中将带有指向该xml文件的“链接”标签,以便浏览器可以允许用户订阅这些rss feed。 The rss xml file will have all of the needed data for each of the blog entries such as title, description, publish date, url. rss xml文件将具有每个博客条目的所有必需数据,例如标题,描述,发布日期,URL。 You will want to use the PHP simpleXML class to load the XML content into a simpleXML object. 您将要使用PHP simpleXML类将XML内容加载到simpleXML对象中。 Then you can access each peice of the rss feed that you need. 然后,您可以访问所需的rss feed的每个peice。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM