简体   繁体   English

PHP从网站爬网数据

[英]PHP crawling data from website

I am currently trying to crawl alot of data from a website, however I am struggling a little bit with it. 我目前正在尝试从网站上抓取大量数据,但是我为此付出了一些努力。 It has an az index and 1-20 index, so it has a bunch of loops and DOM stuff in there. 它具有az索引和1-20索引,因此其中有许多循环和DOM。 However, it managed to crawl and save about 10.000 rows at first run, but now I am at around 15.000 and it is only crawling around 100 per run. 但是,它在第​​一次运行时设法抓取并保存了大约10.000行,但是现在我的速度约为15.000,并且每次运行只抓取大约100行。

It is probably because it has to skip the rows that it already has inserted, (made a check for that). 可能是因为它必须跳过已经插入的行(对此进行了检查)。 I cant think of a way to easily skip some pages, as the 1-20 index varies a lot (for one letter there are 18 pages, other letter are only 2 pages). 我想不出一种可以轻松跳过某些页面的方法,因为1-20索引变化很大(一个字母有18页,其他字母只有2页)。

I was checking if there already was an record with the given ID, if not, insert it. 我正在检查是否已存在具有给定ID的记录,如果没有,则将其插入。 I assumed that would be slow, so now before the script stars I retrieve all rows, and then check with an in_array(), assuming thats faster. 我以为那会很慢,所以现在在脚本开始执行之前,我先检索所有行,然后使用in_array()进行检查(假设那会更快)。 But it just wont work. 但这只是行不通的。

So my crawler is navigating 26 letters, 20 pages each letter, and then up to 50 times each page, so if you calculate it, its a lot. 因此,我的搜寻器导航26个字母,每个字母20页,然后每页最多50次,因此,如果您计算的话,它的工作量很大。

Thought of running it letter by letter, but that wont really work as I am still stuck at "a" and cant just hop onto "b" as I will miss records from "a". 我想逐个字母地运行它,但是那样做不会真正起作用,因为我仍然停留在“ a”上,不能跳到“ b”,因为我会错过“ a”的记录。

Hope I have explained the problem good enough for someone to help me. 希望我对问题的解释足够好,可以有人帮助我。 My code kinda looks like this: (I have removed some stuff here and there, guess all the important stuff is in here to give you an idea) 我的代码有点像这样:(我在这里和那里删除了一些东西,猜想所有重要的东西都在这里给你一个主意)

function in_array_r($needle, $haystack, $strict = false) {
    foreach ($haystack as $item) {
        if (($strict ? $item === $needle : $item == $needle) || (is_array($item) && in_array_r($needle, $item, $strict))) {
            return true;
        }
    }

    return false;
}
/* CONNECT TO DB */
mysql_connect()......



$qry = mysql_query("SELECT uid FROM tableName");
$all = array();
while ($row = mysql_fetch_array($qru)) {
    $all[] = $row;
} // Retrieving all the current database rows to compare later

foreach (range("a", "z") as $key) {
    for ($i = 1; $i < 20; $i++) {
        $dom = new DomDocument();
        $dom->loadHTMLFile("http://www.crawleddomain.com/".$i."/".$key.".htm");
        $finder = new DomXPath($dom);
        $classname="table-striped";
        $nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]");
        foreach ($nodes as $node) {
            $rows = $finder->query("//a[contains(@href, '/value')]", $node);
            foreach ($rows as $row) {
                $url = $row->getAttribute("href");
                $dom2 = new DomDocument();
                $dom2->loadHTMLFile("http://www.crawleddomain.com".$url);
                $finder2 = new DomXPath($dom2);
                $classname2="table-striped";
                $nodes2 = $finder2->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname2 ')]");
                foreach ($nodes2 as $node2) {

                    $rows2 = $finder2->query("//a[contains(@href, '/loremipsum')]", $node2);
                    foreach ($rows2 as $row2) {

                        $dom3 = new DomDocument();
                        //
                        // not so important variable declarations..
                        //


                        $dom3->loadHTMLFile("http://www.crawleddomain.com".$url);
                        $finder3 = new DomXPath($dom3);
                        //2 $finder3->query() right here


                        $query231 = mysql_query("SELECT id FROM tableName WHERE uid='$uid'");
                        $result = mysql_fetch_assoc($query231);
                        //Doing this to get category ID from another table, to insert with this row..
                        $id = $result['id'];


                        if (!in_array_r($uid, $all)) { // if not exist
                            mysql_query("INSERT INTO')"); // insert the whole bunch
                        }

                    }
                }
            }
        }
    }
}

$uid is not defined, also, this query makes no sense: $uid也没有定义,这个查询也没有意义:

mysql_query("INSERT INTO')");

You should turn on error reporting: 您应该打开错误报告:

ini_set('display_errors',1); 
error_reporting(E_ALL);

After your queries you should do an or die(mysql_error()); 查询后,您应该执行or die(mysql_error());

Also, I might as well say it, if I don't someone else will. 另外,如果我没有其他人愿意的话,我也可以这么说。 Don't use mysql_* functions. 不要使用mysql_*函数。 They're deprecated and will be removed from future versions of PHP. 它们已被弃用,并将从以后的PHP版本中删除。 Try PDO . 尝试PDO

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM