简体   繁体   中英

Regular expression against the ID3v2 unsynchronisation scheme in mp3 file?

I'm creating piece of code to check mp3 files on my server and get result do some of them have false sync or not. In short, I'm loading those files in PHP using fread() function and getting stream in variable. After splitting that stream to get separate streams for id3v1 (not necessary, it's not a subject of sync), id3v2 (main problem) and audio, I have to implement that scheme against id3v2 stream.

According to ID3v2 official documentation :

The only purpose of the 'unsynchronisation scheme' is to make the ID3v2 tag as compatible as possible with existing software. There is no use in 'unsynchronising' tags if the file is only to be processed by new software. Unsynchronisation may only be made with MPEG 2 layer I, II and III and MPEG 2.5 files.

Whenever a false synchronisation is found within the tag, one zeroed byte is inserted after the first false synchronisation byte. The format of a correct sync that should be altered by ID3 encoders is as follows:

%11111111 111xxxxx

And should be replaced with:

%11111111 00000000 111xxxxx

This has the side effect that all $FF 00 combinations have to be altered, so they won't be affected by the decoding process. Therefore all the $FF 00 combinations have to be replaced with the $FF 00 00 combination during the unsynchronisation.

To indicate usage of the unsynchronisation, the first bit in 'ID3 flags' should be set (note: I've found that bit). This bit should only be set if the tag contains a, now corrected, false synchronisation. The bit should only be clear if the tag does not contain any false synchronisations.

Do bear in mind, that if a compression scheme is used by the encoder, the unsynchronisation scheme should be applied afterwards . When decoding a compressed, 'unsynchronised' file, the 'unsynchronisation scheme' should be parsed first, decompression afterwards.

My questions are:

  1. How to search & replace this bit-pattern %11111111 111xxxxx with %11111111 00000000 111xxxxx ?
  2. Vice versa, how to search & replace this bit-pattern %11111111 00000000 111xxxxx with %11111111 111xxxxx ?

...using preg_replace() .

Code I've created so far works perfectly and I have just one line more (well, two exactly).

<?php

  // some basic checkings here, such as 'does file exist'
  // and 'is it readable'

  $f = fopen('test.mp3', 'r');

  // ...rest of my code...  

  $pattern1 = '?????'; // pattern from 1st question
  $id3stream = preg_replace($pattern1, 'something1', $id3stream);

  // ...extracting frames...

  $pattern1 = '?????'; // pattern from 2nd question
  $id3stream = preg_replace($pattern2, 'something2', $id3stream);

  // ..do more job...

  fclose($f);

?>

How to make those two lines with preg_replace() function work?

PS I know how to do it reading byte after byte in some kind of loop, but I'm sure this is possible using regular expressions (btw, to be honest, I suck in regex).

Let me know If you need more details.


One more thing...

At the moment I'm using this pattern

$pattern0 = '/[\x00].*/';
echo preg_replace($pattern0, '', $input_string);

to cut off part of string starting at first zero-byte until the end. Is that correct way for doing this?


Update

( @mario's answer ).

In first couple of tests... this code has returned correct result.

  // print original stream
  printStreamHex($stream_original, 'ORIGINAL STREAM');

  // adding zero pads on unsync scheme
  $stream_1 = preg_replace(':([\\xFF])([\\xE0-\\xFF]):', "$1\x00$2", $stream_original);
  printStreamHex($stream_1, 'AFTER ADDING ZEROS');

  // reversing process
  $stream_2 = preg_replace(':([\\xFF])([\\x00])([\\xE0-\\xFF]):', "$1$3", $stream_1);
  printStreamHex($stream_2, 'AFTER REMOVING ZEROS');


  echo "Status: <b>" . ($stream_original == $stream_2 ? "OK" : "Failed") . "</b>";

But minutes after, I've found specific case where everything looks like expected result but there are still FFE0+ pairs in the stream.

ORIGINAL STREAM
+-----------------------------------------------------------------+
| FF  E0  DB  49  53  BE  3B  E0  90  40  EA  2B  3A  61  FF  FA  |
| 84  E0  A9  99  1F  39  B5  E1  54  FF  E7  ED  B8  B1  3A  36  |
| 88  01  69  CA  7D  47  FA  E1  70  7C  85  34  B8  1A  FF  FF  |
| FF  F8  21  F9  2F  FF  F7  17  67  EB  2A  EB  6E  41  82  FF  |
+-----------------------------------------------------------------+

AFTER ADDING ZEROS
+-----------------------------------------------------------------+
| FF  00  E0  DB  49  53  BE  3B  E0  90  40  EA  2B  3A  61  FF  |
| 00  FA  84  E0  A9  99  1F  39  B5  E1  54  FF  00  E7  ED  B8  |
| B1  3A  36  88  01  69  CA  7D  47  FA  E1  70  7C  85  34  B8  |
| 1A  FF  00  FF  FF  00  F8  21  F9  2F  FF  00  F7  17  67  EB  |
| 2A  EB  6E  41  82  FF                                          |
+-----------------------------------------------------------------+

AFTER REMOVING ZEROS
+-----------------------------------------------------------------+
| FF  E0  DB  49  53  BE  3B  E0  90  40  EA  2B  3A  61  FF  FA  |
| 84  E0  A9  99  1F  39  B5  E1  54  FF  E7  ED  B8  B1  3A  36  |
| 88  01  69  CA  7D  47  FA  E1  70  7C  85  34  B8  1A  FF  FF  |
| FF  F8  21  F9  2F  FF  F7  17  67  EB  2A  EB  6E  41  82  FF  |
+-----------------------------------------------------------------+

Status: OK

If stream contains something like FF FF FF FF it will be replaced with FF 00 FF FF 00 FF , but it should be FF 00 FF 00 FF 00 FF . That FF FF pair will false mp3 synchronisation again so my mission is to avoid every FFE0+ pattern before audio stream (in ID3v2 tag-stream; because mp3 starts with FFE0+ byte-pair and it should be first occurrence at the beginning of audio data). I figured out that I can loop same regex until I got stream without FFE0+ byte-pair. Is there any solution that doesn't require loop?

Great job @mario, thanks a lot!

Binary strings are not quite the turf of regular expressions. But you already had the right approach with using \\x00 .

3.. to cut off part of string starting at first zero-byte until the end

$pattern0 = '/[\\x00].*$/';

You were just missing the $ here.

1.. How to search & replace this bit-pattern %11111111 111xxxxx with %11111111 00000000 111xxxxx ?

Use the the sequence FF and E0 for these bit-strings.

preg_replace(':([\\xFF])([\\xE0-\\xFF]):', "$1\x00$2");

Using the $2 here in the replacement string, since you search for a variable byte. Otherwise a simpler str_replace would work.

2.. Vice versa, how to search & replace this bit-pattern %11111111 00000000 111xxxxx with %11111111 111xxxxx ?

Same trick.

preg_replace(':([\\xFF])([\\x00])([\\xE0-\\xFF]):', "$1$3");

I would only watch out to always use the \\ double backslash, so it is PCRE which interpretets the \\x00 hex sequences, not the PHP parser. (It would end up becoming a C string terminator before it reaches libpcre.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM