简体   繁体   English

用php操作dom来抓取数据

[英]Manipulate dom with php to scrape data

I am currently trying to manipulate dom throuhg php to extract views from an fb video page.我目前正在尝试通过 php 操作dom以从 fb 视频页面中提取视图。 The below code was working until a bit ago.下面的代码一直工作到不久前。 However now it doesnt find the node that contains the views count.但是现在它没有找到包含视图计数的node This information is inside a div with id fbPhotoPageMediaInfo .此信息位于 id 为fbPhotoPageMediaInfo的 div 中。 What would be the best way to manipulate the dom through php to get views of an fb video page?通过 php 操作 dom 以获取 fb 视频页面的观看次数的最佳方法是什么?

private function _callCurl($url)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Linux; Android 5.0.1; SAMSUNG-SGH-I337 Build/LRX22C; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
    curl_setopt($ch, CURLOPT_URL, $url);
    $response = curl_exec($ch);
    $http     = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);
    return array(
        $http,
        $response,
    );
}



function test()
{

    $url     = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
    $request = callCurl($url);
    if ($request[0] == 200) {
        $dom = new DOMDocument();
        @$dom->loadHTML($request[1]);
        $elm = $dom->getElementById('fbPhotoPageMediaInfo');
        if (isset($elm->nodeValue)) {
            $views = preg_replace('/[^0-9]/', '', $elm->nodeValue);
        } else {
            $views = null;
        }
    } else {
        echo "Error!";
    }

    return isset($views) ? $views : null;
}

Here is what I've determined...这是我确定的...

  1. If you var_dump() on $request you can see that it's giving you a 302 code (redirect) rather than a 200 (ok).如果你在$request上使用var_dump()你可以看到它给你一个 302 代码(重定向)而不是 200(好的)。
  2. Changing CURLOPT_FOLLOWLOCATION to true or commenting it out entirely makes the error go away, but now we're getting a different page from the one expected.CURLOPT_FOLLOWLOCATION更改为true或将其完全注释掉会使错误消失,但现在我们得到了与预期不同的页面。

I ran the following to see where I was being redirected to:我运行以下命令以查看我被重定向到的位置:

$htm = file_get_contents("https://www.facebook.com/TaylorSwift/videos/10153665021155369/");
var_dump($htm);

This gave me a page saying I was using an outdated browser, and needed to update it.这给了我一个页面,说我使用的是过时的浏览器,需要更新它。 So apparently Facebook doesn't like the User Agent.所以显然 Facebook 不喜欢用户代理。

I updated it as follows:我更新如下:

curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/44.0.2');

That appears to solve the problem.这似乎解决了问题。

Personally I prefer to use Simplehtmldom.我个人更喜欢使用 Simplehtmldom。

FB like other high traffic sites do update their source to help prevent scraping.与其他高流量网站一样,FB 也会更新其来源以帮助防止抓取。 You may in the future have to adjust your node search.您将来可能需要调整您的节点搜索。

<?php
$ua = "Mozilla/5.0 (Windows NT 5.0) AppleWebKit/5321 (KHTML, like Gecko) Chrome/13.0.872.0 Safari/5321"; // must be a valid User Agent
ini_set('user_agent', $ua);

require_once('simplehtmldom/simple_html_dom.php'); // http://simplehtmldom.sourceforge.net/

Function Scrape_FB_Views($url) {

    IF (!filter_var($url, FILTER_VALIDATE_URL) === false) {

        // Create DOM from URL
        $html = file_get_html($url);
        IF ($html) {

            IF (($html->find('span[class=fcg]', 3))) { // 4th instance of span with fcg class
                $text = trim($html->find('span[class=fcg]', 3)->plaintext); // get content of span as plain text
                $result = preg_replace('/[^0-9]/', '', $text); // replace all non-numeric characters
            }ELSE{
                $result = "Node is no longer valid."
            }

        }ELSE{
            $result = "Could not get HTML.";
        }
    }ELSE{
        $result = "URL is invalid.";
    }

    return $result;

}

$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
echo("<p>".Scrape_FB_Views($url)."</p>");
?>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM