简体   繁体   English

如何提高迭代DOMDocument的性能?

[英]How to improve performance iterating a DOMDocument?

I'm using cURL to pull a webpage from a server. 我正在使用cURL从服务器中提取网页。 I pass it to Tidy and throw the output into a DOMDocument. 我将它传递给Tidy并将输出抛出到DOMDocument中。 Then the trouble starts. 然后麻烦开始了。

The webpage contains about three thousand (yikes) table tags, and I'm scraping data from them. 该网页包含大约三千个(yikes)表标签,我正在从中抓取数据。 There are two kinds of tables, where one or more type B follow a type A. 有两种表,其中一种或多种类型B遵循类型A.

I've profiled my script using microtome(true) calls. 我使用microtome(true)调用来描述我的脚本。 I've placed calls before and after each stage of my script and subtracted the times from each other. 我在脚本的每个阶段之前和之后都进行了调用,并相互减去了时间。 So, if you'll follow me through my code, I'll explain it, share the profile results, and point out where the problem is. 所以,如果你跟着我完成我的代码,我会解释它,分享个人资料结果,并指出问题所在。 Maybe you can even help me solve the problem. 也许你甚至可以帮我解决问题。 Here we go: 开始了:

First, I include two files. 首先,我包含两个文件。 One handles some parsing, and the other defines two "data structure" classes. 一个处理一些解析,另一个定义两个“数据结构”类。

// Imports
include('./course.php');
include('./utils.php');

Includes are inconsequential as far as I know, and so let's proceed to the cURL import. 据我所知,包含是无关紧要的,所以让我们继续进行cURL导入。

//  Execute cURL
$response = curl_exec($curl_handle);

I've configured cURL to not time out, and to post some header data, which is required to get a meaningful response. 我已经将cURL配置为不超时,并发布一些标题数据,这是获得有意义的响应所必需的。 Next, I clean up the data to prepare it for DOMDocument. 接下来,我清理数据以准备DOMDocument。

// Run about 25 str_replace calls here, to clean up
// then run tidy.



$html = $response; 

//  
//      Prepare some config for tidy
//  
       $config = array(
                  'indent'         => true,
                  'output-xhtml'   => true,
                   'wrap'           => 200);

    //  
    // Tidy up the HTML
    //  

    $tidy = new tidy;
    $tidy->parseString($html, $config, 'utf8');
    $tidy->cleanRepair();

    $html = $tidy;

Up until now, the code has taken about nine seconds. 到目前为止,代码大约需要9秒钟。 Considering this to be a cron job, running infrequently, I'm fine with that. 考虑到这是一个cron工作,不经常运行,我很好。 However, the next part of the code really barfs. 但是,代码的下一部分确实是barfs。 Here's where I take what I want from the HTML and shove it into my custom classes. 这是我从HTML中获取我想要的内容并将其推送到我的自定义类中。 (I plan to stuff this into a MySQL database too, but this is a first step.) (我计划将其填入MySQL数据库,但这是第一步。)

//  Get all of the tables in the page

$tables = $dom->getElementsByTagName('table');

//  Create a buffer for the courses

$courses = array();

//  Iterate

$numberOfTables = $tables->length;

for ($i=1; $i <$numberOfTables ; $i++) { 

    $sectionTable = $tables->item($i);
    $courseTable = $tables->item($i-1);

    //  We've found a course table, parse it.

    if (elementIsACourseSectionTable($sectionTable)) {

        $course = courseFromTable($courseTable);
        $course = addSectionsToCourseUsingTable($course, $sectionTable);            

        $courses[] = $course;
    }
}   

For reference, here's the utility functions that I call: 作为参考,这是我调用的实用程序函数:

//  
//  Tell us if a given element is
//  a course section table.
//

function elementIsACourseSectionTable(DOMElement $element){

        $tableHasClass = $element->hasAttribute('class');
        $tableIsCourseTable = $element->getAttribute("class") == "coursetable"; 

        return $tableHasClass && $tableIsCourseTable;
}

//
//  Takes a table and parses it into an 
//  instance of the Course class.
//

function courseFromTable(DOMElement $table){

    $secondRow = $table->getElementsByTagName('tr')->item(1);   
    $cells = $secondRow->getElementsByTagName('td');

    $course = new Course;

    $course->startDate = valueForElementInList(0, $cells);
    $course->endDate = valueForElementInList(1, $cells);        
    $course->name = valueForElementInList(2, $cells);
    $course->description = valueForElementInList(3, $cells);
    $course->credits = valueForElementInList(4, $cells);
    $course->hours = valueForElementInList(5, $cells);
    $course->division = valueForElementInList(6, $cells);
    $course->subject = valueForElementInList(7, $cells);

    return $course;

}


//
//  Takes a table and parses it into an 
//  instance of the Section class.
//

function sectionFromRow(DOMElement $row){

    $cells = $row->getElementsByTagName('td');

    //
    //  Skip any row with a single cell
    //

    if ($cells->length == 1) {
        # code...
        return NULL;
    }

    //
    //  Skip header rows
    //

    if (valueForElementInList(0, $cells) == "Section" || valueForElementInList(0, $cells) == "") {
        return NULL;
    }


    $section = new Section;

    $section->section = valueForElementInList(0, $cells);
    $section->code = valueForElementInList(1, $cells);
    $section->openSeats = valueForElementInList(2, $cells);     
    $section->dayAndTime = valueForElementInList(3, $cells);        
    $section->instructor = valueForElementInList(4, $cells);        
    $section->buildingAndRoom = valueForElementInList(5, $cells);
    $section->isOnline = valueForElementInList(6, $cells);  

    return $section;

}

//
//  Take a table containing course sections
//  and parse it put the results into a
//  give course object.
//

function addSectionsToCourseUsingTable(Course $course, DOMElement $table){

    $rows = $table->getElementsByTagName('tr');
    $numRows = $rows->length;

    for ($i=0; $i < $numRows; $i++) { 

        $section = sectionFromRow($rows->item($i));

        //  Make sure we have an array to put sections into 

        if (is_null($course->sections)) {
            $course->sections = array();
        }

        //  Skip "meta" rows, since they're not really sections

        if (is_null($section)) {
            continue;
        }

        $course->addSection($section);
    }

    return $course;
}

//
//  Returns the text from a cell
//  with a 
//

function valueForElementInList($index, $list){
    $value =  $list->item($index)->nodeValue;
    $value = trim($value);
    return $value;
}

This code takes 63 seconds. 此代码需要63秒。 That's over a minute for a PHP script to pull data from a webpage. PHP脚本从网页中提取数据的时间超过一分钟。 Sheesh! 啧!

I've been advised to split up the workload of my main work loop, but considering the homogenous nature of my data, I'm not entirely sure how. 我被建议分开我的主要工作循环的工作量,但考虑到我的数据的同质性​​,我不完全确定如何。 Any suggestions on improving this code are greatly appreciated. 任何有关改进此代码的建议都非常感谢。

What can I do to improve my code execution time? 我该怎么做才能改善代码执行时间?

It turns out that my loop is terribly inefficient. 事实证明,我的循环非常低效。

Using a foreach cut time in half to about 31 seconds. 使用foreach切割时间减半到约31秒。 But that wasn't fast enough. 但那还不够快。 So I reticulated some splines and did some brainstorming with about half of the programmers that I know how to poke online. 所以我对一些样条曲线进行了网络化,并与大约一半的程序员进行了一些头脑风暴,我知道如何在网上戳。 Here's what we found: 这是我们发现的:

Using DOMNodeList's item() accessor is linear, producing exponentially slow processing times in loops. 使用DOMNodeList的item()访问器是线性的,在循环中产生指数级缓慢的处理时间。 So, removing the first element after each iteration makes the loop faster. 因此,在每次迭代后删除第一个元素会使循环更快。 Now, we always access the first element of the list. 现在,我们总是访问列表的第一个元素。 This brought me down to 8 seconds. 这让我降到了8秒。

After playing some more, I realized that the ->length property of DOMNodeList is just as bad as item() , since it also incurs linear cost. 在玩了一些之后,我意识到DOMNodeList->length属性和item()一样糟糕,因为它也会产生线性成本。 So I changed my for loop to this: 所以我将for循环更改为:

    $table = $tables->item(0);

while ($table != NULL) {

    $table = $tables->item(0);

    if ($table === NULL) {
        break;
    }

    //
    //  We've found a section table, parse it.
    //

    if (elementIsACourseSectionTable($table)) {

        $course = addSectionsToCourseUsingTable($course, $table);           
    }

    //
    //  Skip the last table if it's not a course section
    //

    else if(elementIsCourseHeaderTable($table)){
        $course = courseFromTable($table);
        $courses[] = $course;
    }

    //
    //  Remove the first item from the list
    //

    $first = $tables->item(0);
    $first->parentNode->removeChild($first);

    //
    //  Get the next table to parse
    //

    $table = $tables->item(0);
}

Note that I've done some other optimizations in terms of targeting the data I want, but the relevant part is how I handle progressing from one item to the next. 请注意,我已针对我想要的数据进行了一些其他优化,但相关部分是我如何处理从一个项目到下一个项目的进度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM