简体   繁体   中英

PHP preg_match_all regex to extract only number in string

I can't seem to figure out the proper regular expression for extracting just specific numbers from a string. I have an HTML string that has various img tags in it. There are a bunch of img tags in the HTML that I want to extract a portion of the value from. They follow this format:

<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />

So, varying lengths of numbers before what 'usually' is a .jpg (it may be a .gif, .png, or something else too). I want to only extract the number from that string.

The 2nd part of this is that I want to use that number to look up an entry in a database and grab the alt/title tag for that specific id of image. Lastly, I want to add that returned database value into the string and throw it back into the HTML string.

Any thoughts on how to proceed with it would be great...

Thus far, I've tried:

$pattern = '/img src="http://domain.com/images/[0-9]+\/.jpg';
preg_match_all($pattern, $body, $matches);
var_dump($matches);

I think this is the best approach:

  1. Use an HTML parser to extract the image tags
  2. Use a regular expression (or perhaps string manipulation) to extract the ID
  3. Query for the data
  4. Use the HTML parser to insert the returned data

Here is an example. There are improvements I can think of, such as using string manipulation instead of a regex.

$html = '<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />';
$doc = new DOMDocument;
$doc->loadHtml( $html);

foreach( $doc->getElementsByTagName('img') as $img)
{
    $src = $img->getAttribute('src');
    preg_match( '#/images/([0-9]+)\.#i', $src, $matches);
    $id = $matches[1];
    echo 'Fetching info for image ID ' . $id . "\n";

    // Query stuff here
    $result = 'Got this from the DB';

    $img->setAttribute( 'title', $result);
    $img->setAttribute( 'alt', $result);
}

$newHTML = $doc->saveHtml();

Using regular expressions, you can get the number really easily. The third argument for preg_match_all is a by-reference array that will be populated with the matches that were found.

preg_match_all('/<img src="http:\/\/domain.com\/images\/(\d+)\.[a-zA-Z]+"/', $html, $matches);
print_r($matches);

This would contain all of the stuff that it found.

use preg_match_all :

preg_match_all('#<img.*?/(\d+)\.#', $str, $m);
print_r($m);

output:

Array
(
    [0] => Array
        (
            [0] => <img src="http://domain.com/images/59.
            [1] => <img src="http://domain.com/images/549.
            [2] => <img src="http://domain.com/images/1249.
            [3] => <img src="http://domain.com/images/6.
        )

    [1] => Array
        (
            [0] => 59
            [1] => 549
            [2] => 1249
            [3] => 6
        )

)

Consider using preg_replace_callback .

Use this regex: (images/([0-9]+)[^"]+")

Then, as the callback argument, use an anonymous function. Result:

$output = preg_replace_callback(
    "(images/([0-9]+)[^\"]+\")",
    function($m) {
        // $m[1] is the number.
        $t = getTitleFromDatabase($m[1]); // do whatever you have to do to get the title
        return $m[0]." title=\"".$t."\"";
    },
    $input
);

This regex should match the number parts:

\/images\/(?P<digits>[0-9]+)\.[a-z]+

Your $matches['digits'] should have all of the digits you want as an array.

$matches = array();
preg_match_all('/[:digits:]+/', $htmlString, $matches);

Then loop through the matches array to both reconstruct the HTML and to do you look up in the database.

Regular expressions alone are a bit on the loosing ground when it comes to parsing crappy HTML. DOMDocument 's HTML handling is pretty well to serve tagsoup hot and fresh, xpath to select your image srcs and a simple sscanf to extract the number:

$ids = array();
$doc = new DOMDocument();
$doc->loadHTML($html);
foreach(simplexml_import_dom($doc)->xpath('//img/@src[contains(., "/images/")]') as $src) {
    if (sscanf($src, '%*[^0-9]%d', $number)) {
        $ids[] = $number;
    }
}

Because that only gives you an array, why not encapsulate it?

$html = '<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />';

$imageNumbers = new ImageNumbers($html);

var_dump((array) $imageNumbers);

Which gives you:

array(4) {
  [0]=>
  int(59)
  [1]=>
  int(549)
  [2]=>
  int(1249)
  [3]=>
  int(6)
}

By that function above nicely wrapped into an ArrayObject :

class ImageNumbers extends ArrayObject
{
    public function __construct($html) {
        parent::__construct($this->extractFromHTML($html));
    }
    private function extractFromHTML($html) {
        $numbers = array();
        $doc = new DOMDocument();
        $preserve = libxml_use_internal_errors(TRUE);
        $doc->loadHTML($html);
        foreach(simplexml_import_dom($doc)->xpath('//img/@src[contains(., "/images/")]') as $src) {
            if (sscanf($src, '%*[^0-9]%d', $number)) {
                $numbers[] = $number;
            }
        }
        libxml_use_internal_errors($preserve);
        return $numbers;
    }
}

If your HTML should be that malformatted that even DOMDocument::loadHTML() can't handle it, then you only need to handle that internally in the ImageNumbers class.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM