简体   繁体   中英

PHP: split a string of alternating groups of characters into an array

I have a string whose correct syntax is the regex ^([0-9]+[abc])+$ . So examples of valid strings would be: '1a2b' or '00333b1119a555a0c'

For clarity, the string is a list of (value, letter) pairs and the order matters. I'm stuck with the input string so I can't change that. While testing for correct syntax seems easy in principle with the above regex, I'm trying to think of the most efficient way in PHP to transform a compliant string into a usable array something like this:

Input:

'00333b1119a555a0c'

Output:

array (
  0 =>  array('num' => '00333', 'let' => 'b'),
  1 =>  array('num' => '1119', 'let' => 'a'),
  2 =>  array('num' => '555', 'let' => 'a'),
  3 =>  array('num' => '0', 'let' => 'c')
)

I'm having difficulty using preg_match for this. For example this doesn't give the expected result, the intent being to greedy-match on EITHER \\d+ (and save that) OR [abc] (and save that), repeated until end of string reached.

$text = '00b000b0b';
$out = array();
$x = preg_match("/^(?:(\d+|[abc]))+$/", $text, $out);

This didn't work either, the intent here being to greedy-match on \\d+[abc] (and save these), repeated until end of string reached, and split them into numbers and letter afterwards.

$text = '00b000b0b';
$out = array();
$x = preg_match("/^(?:\d+[abc])+$/", $text, $out);

I'd planned to check syntax as part of the preg_match, then use the preg_match output to greedy-match the 'blocks' (or keep the delimiters if using preg_split), then if needed loop through the result 2 items at a time using for (...; i+=2) to extract value-letter in their pairs.

But I can't seem to even get that basic preg_split() or preg_match() approach to work smoothly, much less explore if there's a 'neater' or more efficient way.

Your regex needs a few matching groups

/([0-9]+?)([a-z])/i

This means match all numbers in one group, and all letters in another. Preg match all gets all matches.

The key to the regex is the non greedy flag ? which matches the shortest possible string.

match[0] is the whole match
match[1] is the first match group (the numbers)
match[2] is the second match group (the letter)

example below

<?php
$input = '00333b1119a555a0c';

$regex = '/([0-9]+?)([a-z])/i';

$out = [];

$parsed = [];

if (preg_match_all($regex, $input, $out)) {
    foreach ($out[0] as $index => $value) {
        $parsed[] = [
            'num' => $out[1][$index],
            'let' => $out[2][$index],
        ];
    }
}

var_dump($parsed);

output

array(4) {
  [0] =>
  array(2) {
    'num' =>
    string(5) "00333"
    'let' =>
    string(1) "b"
  }
  [1] =>
  array(2) {
    'num' =>
    string(4) "1119"
    'let' =>
    string(1) "a"
  }
  [2] =>
  array(2) {
    'num' =>
    string(3) "555"
    'let' =>
    string(1) "a"
  }
  [3] =>
  array(2) {
    'num' =>
    string(1) "0"
    'let' =>
    string(1) "c"
  }
}

Simple solution with preg_match_all (with PREG_SET_ORDER flag) and array_map functions:

$input = '00333b1119a555a0c';

preg_match_all('/([0-9]+?)([a-z]+?)/i', $input, $matches, PREG_SET_ORDER);
$result = array_map(function($v) {
    return ['num' => $v[1], 'let' => $v[2]];
}, $matches);

print_r($result);

The output:

Array
(
    [0] => Array
        (
            [num] => 00333
            [let] => b
        )

    [1] => Array
        (
            [num] => 1119
            [let] => a
        )

    [2] => Array
        (
            [num] => 555
            [let] => a
        )

    [3] => Array
        (
            [num] => 0
            [let] => c
        )
)

You can use:

$str = '00333b1119a555a0c';
$arr=array();

if (preg_match_all('/(\d+)(\p{L}+)/', $str, $m)) {
   array_walk( $m[1], function ($v, $k) use(&$arr, $m ) {
       $arr[] = [ 'num'=>$v, 'let'=>$m[2][$k] ]; });
}

print_r($arr);

Output:

Array
(
    [0] => Array
        (
            [num] => 00333
            [let] => b
        )

    [1] => Array
        (
            [num] => 1119
            [let] => a
        )

    [2] => Array
        (
            [num] => 555
            [let] => a
        )

    [3] => Array
        (
            [num] => 0
            [let] => c
        )
)

All of the above work. But they didn't seem to have the elegance I wanted - they needed to loop, use array mapping, or (for preg_match_all()) they needed another almost identical regex as well, just to verify the string matched the regex.

I eventually found that preg_match_all() combined with named captures solved it for me. I hadn't used named captures for that purpose before and it looks powerful.

I also added an optional extra step to simplify the output if dups aren't expected (which wasn't in the question but may help someone).

$input = '00333b1119a555a0c';

preg_match_all("/(?P<num>\d+)(?P<let>[dhm])/", $input, $raw_matches, PREG_SET_ORDER);
print_r($raw_matches);

// if dups not expected this is also worth doing
$matches = array_column($raw_matches, 'num', 'let');

print_r($matches);

More complete version with input+duplicate checking

$input = '00333b1119a555a0c';
if (!preg_match("/^(\d+[abc])+$/",$input)) {
    // OPTIONAL:  detected $input incorrectly formatted
}
preg_match_all("/(?P<num>\d+)(?P<let>[dhm])/", $input, $raw_matches, PREG_SET_ORDER);
$matches = array_column($raw_matches, 'num', 'let');
if (count($matches) != count($raw_matches)) {
    // OPTIONAL:  detected duplicate letters in $input
}
print_r($matches);

Explanation:

This uses preg_match_all() as suggested by @RomanPerekhrest and @exussum to break out the individual groups and split the numbers and letters. I used named groups so that the resulting array of $raw_matches is created with the correct names already.

But if dups arent expected, then I used an extra step with array_column(), which directly extracts data from a nested array of entries and creates a desired flat array, without any need for loops, mapping, walking, or assigning item by item: from

(group1 => (num1, let1), group2 => (num2, let2), ... )

to the "flat" array:

(let1 => num1, let2 => num2, ... )

If named regex matches feels too advanced then they can be ignored - the matches will be given numbers anyway and this will work just as well, you would have to manually assign letters and it's just harder to follow.

preg_match_all("/(\d+)([dhm])/", $input, $raw_matches, PREG_SET_ORDER);
$matches = array_column($raw_matches, 1, 2);

If you need to check for duplicated letters (which wasn't in the question but could be useful), here's how: If the original matches contained >1 entry for any letter then when array_column() is used this letter becomes a key for the new array, and duplicate keys can't exist. Only one entry for each letter gets kept. So we just test whether the number of matches originally found, is the same as the number of matches in the final array after array_coulmn. If not, there were duplicates.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM