简体   繁体   English

PHP preg_match_all正则表达式仅提取字符串中的数字

[英]PHP preg_match_all regex to extract only number in string

I can't seem to figure out the proper regular expression for extracting just specific numbers from a string. 我似乎无法找出从字符串中仅提取特定数字的正确正则表达式。 I have an HTML string that has various img tags in it. 我有一个包含各种img标签的HTML字符串。 There are a bunch of img tags in the HTML that I want to extract a portion of the value from. HTML中有很多img标记,我想从中提取一部分值。 They follow this format: 它们遵循以下格式:

<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />

So, varying lengths of numbers before what 'usually' is a .jpg (it may be a .gif, .png, or something else too). 因此,通常“ .jpg”之前的数字长度会有所不同(它可能是.gif,.png或其他名称)。 I want to only extract the number from that string. 我只想从该字符串中提取数字。

The 2nd part of this is that I want to use that number to look up an entry in a database and grab the alt/title tag for that specific id of image. 第二部分是我想使用该数字在数据库中查找条目,并获取该特定ID的alt / title标签。 Lastly, I want to add that returned database value into the string and throw it back into the HTML string. 最后,我想将返回的数据库值添加到字符串中,并将其扔回到HTML字符串中。

Any thoughts on how to proceed with it would be great... 任何有关如何进行的想法都很棒。

Thus far, I've tried: 到目前为止,我已经尝试过:

$pattern = '/img src="http://domain.com/images/[0-9]+\/.jpg';
preg_match_all($pattern, $body, $matches);
var_dump($matches);

I think this is the best approach: 我认为这是最好的方法:

  1. Use an HTML parser to extract the image tags 使用HTML解析器提取图像标签
  2. Use a regular expression (or perhaps string manipulation) to extract the ID 使用正则表达式(或字符串操作)提取ID
  3. Query for the data 查询数据
  4. Use the HTML parser to insert the returned data 使用HTML解析器插入返回的数据

Here is an example. 这是一个例子。 There are improvements I can think of, such as using string manipulation instead of a regex. 我可以想到一些改进,例如使用字符串操作代替正则表达式。

$html = '<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />';
$doc = new DOMDocument;
$doc->loadHtml( $html);

foreach( $doc->getElementsByTagName('img') as $img)
{
    $src = $img->getAttribute('src');
    preg_match( '#/images/([0-9]+)\.#i', $src, $matches);
    $id = $matches[1];
    echo 'Fetching info for image ID ' . $id . "\n";

    // Query stuff here
    $result = 'Got this from the DB';

    $img->setAttribute( 'title', $result);
    $img->setAttribute( 'alt', $result);
}

$newHTML = $doc->saveHtml();

Using regular expressions, you can get the number really easily. 使用正则表达式,您可以非常轻松地获取数字。 The third argument for preg_match_all is a by-reference array that will be populated with the matches that were found. preg_match_all的第三个参数是一个按引用数组,该数组将使用找到的匹配项进行填充。

preg_match_all('/<img src="http:\/\/domain.com\/images\/(\d+)\.[a-zA-Z]+"/', $html, $matches);
print_r($matches);

This would contain all of the stuff that it found. 这将包含找到的所有内容。

use preg_match_all : 使用preg_match_all

preg_match_all('#<img.*?/(\d+)\.#', $str, $m);
print_r($m);

output: 输出:

Array
(
    [0] => Array
        (
            [0] => <img src="http://domain.com/images/59.
            [1] => <img src="http://domain.com/images/549.
            [2] => <img src="http://domain.com/images/1249.
            [3] => <img src="http://domain.com/images/6.
        )

    [1] => Array
        (
            [0] => 59
            [1] => 549
            [2] => 1249
            [3] => 6
        )

)

Consider using preg_replace_callback . 考虑使用preg_replace_callback

Use this regex: (images/([0-9]+)[^"]+") 使用此正则表达式: (images/([0-9]+)[^"]+")

Then, as the callback argument, use an anonymous function. 然后,使用匿名函数作为callback参数。 Result: 结果:

$output = preg_replace_callback(
    "(images/([0-9]+)[^\"]+\")",
    function($m) {
        // $m[1] is the number.
        $t = getTitleFromDatabase($m[1]); // do whatever you have to do to get the title
        return $m[0]." title=\"".$t."\"";
    },
    $input
);

This regex should match the number parts: 此正则表达式应与数字部分匹配:

\/images\/(?P<digits>[0-9]+)\.[a-z]+

Your $matches['digits'] should have all of the digits you want as an array. 您的$matches['digits']应该具有您想要的所有数字作为数组。

$matches = array();
preg_match_all('/[:digits:]+/', $htmlString, $matches);

Then loop through the matches array to both reconstruct the HTML and to do you look up in the database. 然后遍历matches数组以重新构造HTML并在数据库中查找。

Regular expressions alone are a bit on the loosing ground when it comes to parsing crappy HTML. 在解析糟糕的HTML时,正则表达式本身就显得有些松懈。 DOMDocument 's HTML handling is pretty well to serve tagsoup hot and fresh, xpath to select your image srcs and a simple sscanf to extract the number: DOMDocument的HTML处理非常好,可以立即提供新鲜的tagoup,可以使用xpath选择图像src,还可以使用简单的sscanf提取数字:

$ids = array();
$doc = new DOMDocument();
$doc->loadHTML($html);
foreach(simplexml_import_dom($doc)->xpath('//img/@src[contains(., "/images/")]') as $src) {
    if (sscanf($src, '%*[^0-9]%d', $number)) {
        $ids[] = $number;
    }
}

Because that only gives you an array, why not encapsulate it? 因为那只会给您一个数组,为什么不封装它呢?

$html = '<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />';

$imageNumbers = new ImageNumbers($html);

var_dump((array) $imageNumbers);

Which gives you: 这给你:

array(4) {
  [0]=>
  int(59)
  [1]=>
  int(549)
  [2]=>
  int(1249)
  [3]=>
  int(6)
}

By that function above nicely wrapped into an ArrayObject : 通过上面的那个函数可以很好地包装到ArrayObject

class ImageNumbers extends ArrayObject
{
    public function __construct($html) {
        parent::__construct($this->extractFromHTML($html));
    }
    private function extractFromHTML($html) {
        $numbers = array();
        $doc = new DOMDocument();
        $preserve = libxml_use_internal_errors(TRUE);
        $doc->loadHTML($html);
        foreach(simplexml_import_dom($doc)->xpath('//img/@src[contains(., "/images/")]') as $src) {
            if (sscanf($src, '%*[^0-9]%d', $number)) {
                $numbers[] = $number;
            }
        }
        libxml_use_internal_errors($preserve);
        return $numbers;
    }
}

If your HTML should be that malformatted that even DOMDocument::loadHTML() can't handle it, then you only need to handle that internally in the ImageNumbers class. 如果您的HTML格式错误,甚至DOMDocument::loadHTML()也无法处理,那么您只需在ImageNumbers类中内部处理ImageNumbers

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM