简体   繁体   English

正则表达式与PHP中的内容不匹配

[英]Regular Expression not matching content in PHP

I am trying to scrape an ebay page such as this one: http://www.ebay.co.uk/sch/Cars-/9801/i.html?_nkw=vw+golf 我正在尝试抓取这样的eBay页面: http ://www.ebay.co.uk/sch/Cars-/9801/i.html?_nkw=vw+golf

Everything works great except one of my regular expressions just isn't matching the content and therefore the matches aren't being pushed to $linksArray I have outputted the contents to make sure what I am trying to match is infact there - and it is. 一切工作都很好,除了我的一个正则表达式只是不匹配内容,因此匹配没有被推送到$linksArray我已经输出了内容,以确保我要匹配的内容在这里确实存在-确实如此。 I then go print_r($linksArray) where all the matches should be. 然后,我去所有匹配项都应该print_r($linksArray) but it's not. 但事实并非如此。 It is an empty multi dimensional array. 它是一个空的多维数组。 You can see my live example here: http://www.mycommunity.co.za/marcksack/index.php 您可以在这里看到我的实时示例: http : //www.mycommunity.co.za/marcksack/index.php

Here is my PHP code: 这是我的PHP代码:

<?php
echo '<form method="POST">
<input type="text" id="url" name="url" size="120" value="' . (isset($_REQUEST["url"]) && !empty($_REQUEST["url"]) ? $_REQUEST["url"] : "") . '"/>
<input type="submit" value="Submit" />
</form>';
flush();

if (isset($_REQUEST["url"]) && !empty($_REQUEST["url"])) {
    $url = $_REQUEST["url"];
    $phones = array();
    for ($page = 1; $page <= 1; $page++) {

        // get page contents

        $contents = file_get_contents($url . "&_pgn=" . $page);
        echo(htmlentities($contents));
        // find all links patterns
        // HERE IS THE PROBLEM
        $pattern = '/class="lvtitle"><a href="(.*)" class="vip"/';
        $linksArray = array();
        preg_match_all($pattern, $contents, $linksArray);
        print_r($linksArray);
        $links = $linksArray[0];

        foreach($links as $link) {
            $pureLink = str_replace("class=\"lvtitle\"><a href=\"", "", $link);
            $pureLink = str_replace("\" class=\"vip\"", "", $pureLink);

            // getting sub page contents

            $subContents = file_get_contents($pureLink);

            // find all links patterns

            $subContents = str_replace(" ", "", $subContents);
            $phonePattern = '/07[0-9]{9}/';
            $phonesArray = array();
            preg_match_all($phonePattern, $subContents, $phonesArray);
            foreach($phonesArray[0] as $element) {

                // check if phone not added previousely to the phones array

                if (!in_array($element, $phones)) {

                    // add it to the phones array

                    array_push($phones, $element);
                    echo $element . "<br />";
                    flush();
                }
            }
        }
    }

    // print results
    foreach($phones as $phone){
        echo $phone."<br/>";
    }

}

?>

So obviously my question is what am I doing wrong? 所以很明显我的问题是我在做什么错? Why are the matches not being pushed to my $linksArray variable. 为什么不将匹配项推送到我的$linksArray变量中。 I really appreciate your help! 非常感谢您的帮助!

This regex works: 此正则表达式有效:

"/ class=\"lvtitle\"><a href=\"([^\"]*)\"  class=\"vip\"/"

A few issues with your's: 您的几个问题:

  1. You were trying to capture the URL using (.*), which will match the entire line. 您试图使用(。*)捕获URL,该URL将匹配整行。
  2. It was not matching the entire line because ebay has two spaces in between the class and href attributes. 它与整行不匹配,因为ebay在class和href属性之间有两个空格。

Also, as has already been mentioned, you should use the API or DOMDocument for this. 另外,正如已经提到的,您应该为此使用API​​或DOMDocument。 But in case you are curious, this is why it wasn't working. 但是如果您好奇的话,这就是为什么它不起作用的原因。 I hope that helps! 希望对您有所帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM