在Controller中将Goutte与Symfony2结合使用

Question

I'm trying to scrape a page and I'm not very familiar with php frameworks, so I've been trying to learn Symfony2. 我正在尝试抓取页面，并且对php框架不是很熟悉，因此我一直在尝试学习Symfony2。 I have it up and running, and now I'm trying to use Goutte. 我已经启动并运行它，现在我正在尝试使用Goutte。 It's installed in the vendor folder, and I have a bundle I'm using for my scraping project. 它安装在vendor文件夹中，我有一个捆绑用于我的抓取项目。

Question is, is it good practice to do scraping from a Controller ? 问题是，从Controller抓取是否是一种好习惯？ And how? 如何？ I have searched forever and cannot figure out how to use Goutte from a bundle, since it's buried deep withing the file structure. 我一直在搜索，但无法从捆绑软件中找出如何使用Goutte ，因为Goutte与文件结构一起深埋。

<?php

namespace ontf\scraperBundle\Controller;

use Symfony\Bundle\FrameworkBundle\Controller\Controller;
use Goutte\Client;

class ThingController extends Controller
{
  public function somethingAction($something)
  {

    $client = new Client();
    $crawler = $client->request('GET', 'http://www.symfony.com/blog/');
    echo $crawler->text();


    return $this->render('scraperBundle:Thing:index.html.twig');

    // return $this->render('scraperBundle:Thing:index.html.twig', array(
    //     'something' => $something
    //     ));
  }

} }

Answer 1

I'm not sure I have heard of "good practices" as far as scraping goes but you may be able to find some in the book PHP Architect's Guide to Web Scraping with PHP . 我不确定我是否听说过“好的做法”，但是您也许可以在《 PHP架构师的PHP Web爬虫指南》一书中找到一些技巧。

These are some guidelines I have used in my own projects: 这些是我在自己的项目中使用的一些准则：

Scraping is a slow process, consider delegating that task to a background process. 搜寻是一个缓慢的过程，请考虑将该任务委派给后台过程。
Background process normally run as a cron job that executing a CLI application or a worker that is constantly running. 后台进程通常作为执行CLI应用程序的cron作业或持续运行的工作程序运行。
Use a process control system to manage your workers. 使用过程控制系统来管理您的工人。 Take a look at supervisord 看看主管
Save every scraped file (the "raw" version), and log every error. 保存每个抓取的文件（“原始”版本），并记录每个错误。 This will enable you to detect problems. 这将使您能够发现问题。 Use Rackspace Cloud Files or AWS S3 to archive these files. 使用Rackspace Cloud Files或AWS S3存档这些文件。
Use the Symfony2 Console tool to create the commands to run your scraper. 使用Symfony2控制台工具创建运行刮板的命令。 You can save the commands in your bundle under the Command directory. 您可以将命令保存在命令目录下的捆绑软件中。
Run your Symfony2 commands using the following flags to prevent running out of memory: php app/console scraper:run example.com --env=prod --no-debug Where app/console is where the Symfony2 console applicaiton lives, scraper:run is the name of your command, example.com is an argument to indicate the page you want to scrape, and the --env=prod --no-debug are the flags you should use to run in production. 使用以下标志运行您的Symfony2命令，以防止内存不足： php app/console scraper:run example.com --env=prod --no-debug Symfony2控制台应用程序所在的应用程序/控制台是scraper：run是命令的名称，example.com是表示要抓取的页面的参数，--env = prod --no-debug是在生产中运行时应使用的标志。 see code below for example. 例如，请参见下面的代码。
Inject the Goutte Client into your command like such: 将Goutte Client注入到您的命令中，如下所示：

Ontf/ScraperBundle/Resources/services.yml Ontf / ScraperBundle /资源/ services.yml

services:
    goutte_client:
        class: Goutte\Client

    scraperCommand:
        class:  Ontf\ScraperBundle\Command\ScraperCommand
        arguments: ["@goutte_client"]
        tags:
            - { name: console.command }

And your command should look something like this: 您的命令应如下所示：

<?php
// Ontf/ScraperBundle/Command/ScraperCommand.php
namespace Ontf\ScraperBundle\Command;

use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputArgument;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Input\InputOption;
use Symfony\Component\Console\Output\OutputInterface;
use Goutte\Client;

abstract class ScraperCommand extends Command
{
    private $client;

    public function __construct(Client $client)
    {
        $this->client = $client;
        parent::__construct();
    }

    protected function configure()
    {
        ->setName('scraper:run')
            ->setDescription('Run Goutte Scraper.')
            ->addArgument(
                'url',
                InputArgument::REQUIRED,
                'URL you want to scrape.'
            );
    }

    protected function execute(InputInterface $input, OutputInterface $output) 
    {
        $url = $input->getArgument('url');
        $crawler = $this->client->request('GET', $url);
        echo $crawler->text();
    }
}

Answer 2

You Should take a Symfony-Controller if you want to return a response, eG a html output. 如果要返回响应（例如html输出），则应使用Symfony-Controller。

if you only need the function for calculating or storing stuff in database, You should create a Service class that represents the functionality of your Crawler, eG 如果仅需要用于在数据库中计算或存储内容的功能，则应创建一个代表您的Crawler eG功能的Service类

class CrawlerService
{
    function getText($url){
        $client = new Client();
        $crawler = $client->request('GET', $url);
        return $crawler->text();
    }

and to execute it i would use a Console Command 并执行它，我将使用控制台命令

If you want to return a Response use a Controller 如果要返回响应，请使用控制器

在Controller中将Goutte与Symfony2结合使用

问题描述

2 个解决方案

解决方案1
4 2015-03-16 22:26:18

解决方案2
0 2015-03-16 22:18:29

在Controller中将Goutte与Symfony2结合使用

问题描述

2 个解决方案

解决方案1 4 2015-03-16 22:26:18

解决方案2 0 2015-03-16 22:18:29

解决方案1
4 2015-03-16 22:26:18

解决方案2
0 2015-03-16 22:18:29