简体   繁体   English

数组中的函数php

[英]Function inside an array php

So I have a web crawler that I am working on.所以我有一个正在开发的网络爬虫。 And I have a CSV file that have about one million websites that I want to pass to be crawled.我有一个 CSV 文件,里面有大约 100 万个网站,我想通过这些网站进行抓取。 My problem is that I am able to save the CSV file in an array but when I pass it to the method that crawls it;我的问题是我能够将 CSV 文件保存在一个数组中,但是当我将它传递给抓取它的方法时; it seems that it takes the first element and crawls it not the whole array.似乎它需要第一个元素并抓取它而不是整个数组。 Can someone help me?有人能帮我吗?

<?php


    include("classes/DomDocumentParser.php");
    include("config.php");



    $alreadyCrawled =  array();
    $crawling =  array();
    $alreadyFoundImages = array();
    $my_list = array();



    function linkExists($url){
            global $con;

            $query = $con->prepare("SELECT * FROM sites WHERE url = :url");

            $query ->bindParam(":url",$url);
            $query->execute();

            return $query->rowCount() != 0;
    }

    function insertImage($url,$src,$title,$alt){
            global $con;

            $query = $con->prepare("INSERT INTO images(siteUrl, imageUrl, alt, title)
                                            VALUES(:siteUrl,:imageUrl,:alt,:title)");

            $query ->bindParam(":siteUrl",$url);
            $query ->bindParam(":imageUrl",$src);
            $query ->bindParam(":alt",$alt);
            $query ->bindParam(":title",$title);

            return $query->execute();
    }

    function insertLink($url,$title,$description,$keywords){
            global $con;

            $query = $con->prepare("INSERT INTO sites(Url, title, description, keywords)
                                            VALUES(:url,:title,:description,:keywords)");

            $query ->bindParam(":url",$url);
            $query ->bindParam(":title",$title);
            $query ->bindParam(":description",$description);
            $query ->bindParam(":keywords",$keywords);

            return $query->execute();
    }

    function createLink($src,$url){

            $scheme = parse_url($url)["scheme"]; // http or https
            $host = parse_url($url)["host"]; // www.mohamad-ahmad.com

            if(substr($src,0,2) =="//"){
                    //  //www.mohanadahmad.com
                    $src  = $scheme . ":" . $src;
            }
            else  if(substr($src,0,1) =="/"){
                    //  /aboutus/about.php
                    $src  = $scheme . "://" . $host . $src;
            }
            else if(substr($src,0,2) =="./"){
                    //  ./aboutus/about.php
                    $src  = $scheme . "://" . $host . dirname(parse_url($url)["path"]) . substr($src ,1);
            }
            else if(substr($src,0,3) =="../"){
                    //  ../aboutus/about.php
                    $src  = $scheme . "://" . $host . "/" . $src;
            }
            else if(substr($src,0,5) !="https" && substr($src,0,4) !="http" ){
                    //  aboutus/about.php
                    $src  = $scheme . "://" . $host ."/" .$src;
            }
            return $src;
    }

    function getDetails($url){

            global $alreadyFoundImages;

            $parser = new DomDocumentParser($url);

            $titleArray = $parser->getTitletags();

            if(sizeof($titleArray) == 0 || $titleArray->item(0) == NULL){
                    return;
            }

            $title = $titleArray -> item(0) -> nodeValue;
            $title = str_replace("\n","",$title);

            if($title == ""){
                    return;
            }


            $description="";
            $keywords="";

            $metasArray = $parser -> getMetatags();

            foreach($metasArray as $meta){

                    if($meta->getAttribute("name") == "description"){
                            $description = $meta -> getAttribute("content");
                    }
                    if($meta->getAttribute("name") == "keywords"){
                            $keywords = $meta -> getAttribute("content");
                    }

            }

            $description = str_replace("\n","",$description);
            $keywords = str_replace("\n","",$keywords);    

            if(linkExists($url)){
                    echo "$url already exists <br>";
            }
            else if(insertLink($url,$title,$description,$keywords)){
                    echo "SUCCESS: $url <br>";
            }
            else{
                    echo "ERROR: Failed to insert $url <br>";
            }

            $imageArray = $parser ->getImages();
            foreach($imageArray as $image){

                    $src = $image->getAttribute("src");
                    $alt = $image->getAttribute("alt");
                    $title = $image->getAttribute("title");

                    if(!$title && !$alt){
                            continue;
                    }

                    $src = createLink($src,$url);

                    if(!in_array($src,$alreadyFoundImages)){
                            $alreadyFoundImages[] = $src;

                            insertImage($url,$src,$alt,$title);
                    }
            }
    }

    function followLinks($url) {

        global $crawling;
        global $alreadyCrawled;

        $parser = new DomDocumentParser($url);

        $linkList = $parser->getLinks();
        foreach($linkList as $link){

                $href = $link->getAttribute("href");

                if(strpos($href,"#") !==false){
                        // Ignore anchor url
                        continue;
                }
                else if(substr($href,0,11)== "javascript:"){
                        // Ignore javascript url 
                        continue;
                }

                $href = createLink($href,$url);

                if(!in_array($href,$alreadyCrawled)){
                        $alreadyCrawled[] = $href;
                        $crawling[] = $href;

                        //getDetails contain the insert into db
                        getDetails($href);
                }         
        }
        array_shift($crawling);

        foreach($crawling as $site){
                followLinks($site);
        }
}

        function fill_my_list(){

            global $my_list;    

            $file = fopen('top-1m.csv', 'r');
            while( ($data = fgetcsv($file)) !== false ) {
            $startUrl = "https://www.".$data[1];
            $my_list[] = $startUrl;
             }
             foreach($my_list as $key => $u){
             followLinks($u);
             }
        }
        fill_my_list();    
?>

You can do something like this by php.net你可以通过 php.net 做这样的事情

$row = 1;
if (($File = fopen("test.csv", "r")) !== FALSE) {
  while (($data = fgetcsv($File, 1000, ",")) !== FALSE) {
    $num = count($data);
    echo "<p> $num fields in line $row: <br /></p>\n";
    $row++;
    for ($c=0; $c < $num; $c++) {
        echo $data[$c] . "<br />\n";
    }
//Use $data[$c];
  }
  fclose($File);
}

Here more examples : https://www.php.net/manual/en/function.fgetcsv.php#refsect1-function.fgetcsv-examples这里有更多例子: https : //www.php.net/manual/en/function.fgetcsv.php#refsect1-function.fgetcsv-examples

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM