简体   繁体   中英

Using Goutte with Symfony2 in Controller

I'm trying to scrape a page and I'm not very familiar with php frameworks, so I've been trying to learn Symfony2. I have it up and running, and now I'm trying to use Goutte. It's installed in the vendor folder, and I have a bundle I'm using for my scraping project.

Question is, is it good practice to do scraping from a Controller ? And how? I have searched forever and cannot figure out how to use Goutte from a bundle, since it's buried deep withing the file structure.

<?php

namespace ontf\scraperBundle\Controller;

use Symfony\Bundle\FrameworkBundle\Controller\Controller;
use Goutte\Client;

class ThingController extends Controller
{
  public function somethingAction($something)
  {

    $client = new Client();
    $crawler = $client->request('GET', 'http://www.symfony.com/blog/');
    echo $crawler->text();


    return $this->render('scraperBundle:Thing:index.html.twig');

    // return $this->render('scraperBundle:Thing:index.html.twig', array(
    //     'something' => $something
    //     ));
  }

}

I'm not sure I have heard of "good practices" as far as scraping goes but you may be able to find some in the book PHP Architect's Guide to Web Scraping with PHP .

These are some guidelines I have used in my own projects:

  1. Scraping is a slow process, consider delegating that task to a background process.
  2. Background process normally run as a cron job that executing a CLI application or a worker that is constantly running.
  3. Use a process control system to manage your workers. Take a look at supervisord
  4. Save every scraped file (the "raw" version), and log every error. This will enable you to detect problems. Use Rackspace Cloud Files or AWS S3 to archive these files.
  5. Use the Symfony2 Console tool to create the commands to run your scraper. You can save the commands in your bundle under the Command directory.
  6. Run your Symfony2 commands using the following flags to prevent running out of memory: php app/console scraper:run example.com --env=prod --no-debug Where app/console is where the Symfony2 console applicaiton lives, scraper:run is the name of your command, example.com is an argument to indicate the page you want to scrape, and the --env=prod --no-debug are the flags you should use to run in production. see code below for example.
  7. Inject the Goutte Client into your command like such:

Ontf/ScraperBundle/Resources/services.yml

services:
    goutte_client:
        class: Goutte\Client

    scraperCommand:
        class:  Ontf\ScraperBundle\Command\ScraperCommand
        arguments: ["@goutte_client"]
        tags:
            - { name: console.command }

And your command should look something like this:

<?php
// Ontf/ScraperBundle/Command/ScraperCommand.php
namespace Ontf\ScraperBundle\Command;

use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputArgument;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Input\InputOption;
use Symfony\Component\Console\Output\OutputInterface;
use Goutte\Client;

abstract class ScraperCommand extends Command
{
    private $client;

    public function __construct(Client $client)
    {
        $this->client = $client;
        parent::__construct();
    }

    protected function configure()
    {
        ->setName('scraper:run')
            ->setDescription('Run Goutte Scraper.')
            ->addArgument(
                'url',
                InputArgument::REQUIRED,
                'URL you want to scrape.'
            );
    }

    protected function execute(InputInterface $input, OutputInterface $output) 
    {
        $url = $input->getArgument('url');
        $crawler = $this->client->request('GET', $url);
        echo $crawler->text();
    }
}

You Should take a Symfony-Controller if you want to return a response, eG a html output.

if you only need the function for calculating or storing stuff in database, You should create a Service class that represents the functionality of your Crawler, eG

class CrawlerService
{
    function getText($url){
        $client = new Client();
        $crawler = $client->request('GET', $url);
        return $crawler->text();
    }

and to execute it i would use a Console Command

If you want to return a Response use a Controller

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM