针对mp3文件中ID3v2不同步方案的正则表达式？

Question

I'm creating piece of code to check mp3 files on my server and get result do some of them have false sync or not. 我正在创建一段代码来检查服务器上的mp3文件，并获得结果，说明其中某些文件是否具有错误的同步。 In short, I'm loading those files in PHP using fread() function and getting stream in variable. 简而言之，我正在使用fread（）函数在PHP中加载这些文件，并在变量中获取流。 After splitting that stream to get separate streams for id3v1 (not necessary, it's not a subject of sync), id3v2 (main problem) and audio, I have to implement that scheme against id3v2 stream. 在将该流拆分为id3v1（没有必要，它不是同步主题），id3v2（主要问题）和音频以获取单独的流之后，我必须针对id3v2流实现该方案。

According to ID3v2 official documentation : 根据ID3v2官方文档：

The only purpose of the 'unsynchronisation scheme' is to make the ID3v2 tag as compatible as possible with existing software. “非同步方案”的唯一目的是使ID3v2标签与现有软件尽可能兼容。 There is no use in 'unsynchronising' tags if the file is only to be processed by new software. 如果文件仅由新软件处理，则“不同步”标签中没有任何用处。 Unsynchronisation may only be made with MPEG 2 layer I, II and III and MPEG 2.5 files. 只能使用MPEG 2的I，II和III层以及MPEG 2.5文件进行不同步。

Whenever a false synchronisation is found within the tag, one zeroed byte is inserted after the first false synchronisation byte. 只要在标签内发现错误同步，就会在第一个错误同步字节之后插入一个清零字节。 The format of a correct sync that should be altered by ID3 encoders is as follows: ID3编码器应更改的正确同步的格式如下：

%11111111 111xxxxx ％11111111 111xxxxx

And should be replaced with: 并应替换为：

%11111111 00000000 111xxxxx ％11111111 00000000 111xxxxx

This has the side effect that all $FF 00 combinations have to be altered, so they won't be affected by the decoding process. 这样做的副作用是必须更改所有$ FF 00组合，因此它们不会受到解码过程的影响。 Therefore all the $FF 00 combinations have to be replaced with the $FF 00 00 combination during the unsynchronisation. 因此，在非同步期间，所有$ FF 00 00组合都必须替换为$ FF 00 00组合。

To indicate usage of the unsynchronisation, the first bit in 'ID3 flags' should be set (note: I've found that bit). 为了指示不同步的用法，应设置“ ID3标志”中的第一位（注意：我已经找到了该位）。 This bit should only be set if the tag contains a, now corrected, false synchronisation. 仅当标签包含现已纠正的错误同步时，才应设置此位。 The bit should only be clear if the tag does not contain any false synchronisations. 仅当标签不包含任何错误同步时，才应清除该位。

Do bear in mind, that if a compression scheme is used by the encoder, the unsynchronisation scheme should be applied afterwards . 请记住，如果编码器使用了压缩方案，则应随后应用不同步方案。 When decoding a compressed, 'unsynchronised' file, the 'unsynchronisation scheme' should be parsed first, decompression afterwards. 解码压缩的“非同步”文件时，应首先解析“非同步方案”，然后再解压缩。

My questions are: 我的问题是：

How to search & replace this bit-pattern %11111111 111xxxxx with %11111111 00000000 111xxxxx ? 如何搜索并用%11111111 00000000 111xxxxx替换此位模式%11111111 111xxxxx %11111111 00000000 111xxxxx ？
Vice versa, how to search & replace this bit-pattern %11111111 00000000 111xxxxx with %11111111 111xxxxx ? 反之亦然，如何使用%11111111 111xxxxx搜索和替换此位模式%11111111 00000000 111xxxxx %11111111 111xxxxx ？

...using preg_replace() . ...使用preg_replace（）。

Code I've created so far works perfectly and I have just one line more (well, two exactly). 到目前为止，我已经创建的代码可以完美地工作，而且我只增加了一行（嗯，恰好是两行）。

<?php

  // some basic checkings here, such as 'does file exist'
  // and 'is it readable'

  $f = fopen('test.mp3', 'r');

  // ...rest of my code...  

  $pattern1 = '?????'; // pattern from 1st question
  $id3stream = preg_replace($pattern1, 'something1', $id3stream);

  // ...extracting frames...

  $pattern1 = '?????'; // pattern from 2nd question
  $id3stream = preg_replace($pattern2, 'something2', $id3stream);

  // ..do more job...

  fclose($f);

?>

How to make those two lines with preg_replace() function work? 如何使用preg_replace（）函数使这两行起作用？

PS I know how to do it reading byte after byte in some kind of loop, but I'm sure this is possible using regular expressions (btw, to be honest, I suck in regex). PS我知道如何在某种循环中逐字节读取字节，但是我敢肯定，使用正则表达式是可行的（顺便说一句，我正则表达式很烂）。

Let me know If you need more details. 让我知道是否需要更多详细信息。

One more thing... 还有一件事...

At the moment I'm using this pattern 目前，我正在使用此模式

$pattern0 = '/[\x00].*/';
echo preg_replace($pattern0, '', $input_string);

to cut off part of string starting at first zero-byte until the end. 从第一个零字节开始截断字符串的一部分，直到结尾。 Is that correct way for doing this? 那是这样做的正确方法吗？

Update 更新

( @mario's answer ). （ @mario的答案）。

In first couple of tests... this code has returned correct result. 在前几次测试中，此代码返回了正确的结果。

  // print original stream
  printStreamHex($stream_original, 'ORIGINAL STREAM');

  // adding zero pads on unsync scheme
  $stream_1 = preg_replace(':([\\xFF])([\\xE0-\\xFF]):', "$1\x00$2", $stream_original);
  printStreamHex($stream_1, 'AFTER ADDING ZEROS');

  // reversing process
  $stream_2 = preg_replace(':([\\xFF])([\\x00])([\\xE0-\\xFF]):', "$1$3", $stream_1);
  printStreamHex($stream_2, 'AFTER REMOVING ZEROS');


  echo "Status: <b>" . ($stream_original == $stream_2 ? "OK" : "Failed") . "</b>";

But minutes after, I've found specific case where everything looks like expected result but there are still FFE0+ pairs in the stream. 但是几分钟后，我发现了一种特殊情况，即一切看起来都像预期的结果，但流中仍然有FFE0 +对。

ORIGINAL STREAM
+-----------------------------------------------------------------+
| FF  E0  DB  49  53  BE  3B  E0  90  40  EA  2B  3A  61  FF  FA  |
| 84  E0  A9  99  1F  39  B5  E1  54  FF  E7  ED  B8  B1  3A  36  |
| 88  01  69  CA  7D  47  FA  E1  70  7C  85  34  B8  1A  FF  FF  |
| FF  F8  21  F9  2F  FF  F7  17  67  EB  2A  EB  6E  41  82  FF  |
+-----------------------------------------------------------------+

AFTER ADDING ZEROS
+-----------------------------------------------------------------+
| FF  00  E0  DB  49  53  BE  3B  E0  90  40  EA  2B  3A  61  FF  |
| 00  FA  84  E0  A9  99  1F  39  B5  E1  54  FF  00  E7  ED  B8  |
| B1  3A  36  88  01  69  CA  7D  47  FA  E1  70  7C  85  34  B8  |
| 1A  FF  00  FF  FF  00  F8  21  F9  2F  FF  00  F7  17  67  EB  |
| 2A  EB  6E  41  82  FF                                          |
+-----------------------------------------------------------------+

AFTER REMOVING ZEROS
+-----------------------------------------------------------------+
| FF  E0  DB  49  53  BE  3B  E0  90  40  EA  2B  3A  61  FF  FA  |
| 84  E0  A9  99  1F  39  B5  E1  54  FF  E7  ED  B8  B1  3A  36  |
| 88  01  69  CA  7D  47  FA  E1  70  7C  85  34  B8  1A  FF  FF  |
| FF  F8  21  F9  2F  FF  F7  17  67  EB  2A  EB  6E  41  82  FF  |
+-----------------------------------------------------------------+

Status: OK

If stream contains something like FF FF FF FF it will be replaced with FF 00 FF FF 00 FF , but it should be FF 00 FF 00 FF 00 FF . 如果流包含FF FF FF FF ，它将被FF 00 FF FF 00 FF代替，但应为FF 00 FF 00 FF 00 FF 。 That FF FF pair will false mp3 synchronisation again so my mission is to avoid every FFE0+ pattern before audio stream (in ID3v2 tag-stream; because mp3 starts with FFE0+ byte-pair and it should be first occurrence at the beginning of audio data). 那个FF FF对将再次错误mp3同步，因此我的任务是避免音频流之前的每个FFE0+模式（在ID3v2标签流中；因为mp3以FFE0+字节对开头，并且应该首先出现在音频数据的开头）。 I figured out that I can loop same regex until I got stream without FFE0+ byte-pair. 我发现我可以循环使用相同的正则表达式，直到得到没有FFE0 +字节对的流。 Is there any solution that doesn't require loop? 是否有不需要循环的解决方案？

Great job @mario, thanks a lot! 很好@mario，非常感谢！

Answer 1

Binary strings are not quite the turf of regular expressions. 二进制字符串并不完全是正则表达式。 But you already had the right approach with using \\x00 . 但是，使用\\x00您已经有了正确的方法。

3.. to cut off part of string starting at first zero-byte until the end 3 ..截断从第一个零字节开始的字符串部分，直到结尾

$pattern0 = '/[\\x00].*$/';

You were just missing the $ here. 您只是在这里错过了$ 。

1.. How to search & replace this bit-pattern %11111111 111xxxxx with %11111111 00000000 111xxxxx ? 1 ..如何用%11111111 00000000 111xxxxx搜索和替换此位模式%11111111 111xxxxx %11111111 00000000 111xxxxx ？

Use the the sequence FF and E0 for these bit-strings. 对这些位串使用序列FF和E0 。

preg_replace(':([\\xFF])([\\xE0-\\xFF]):', "$1\x00$2");

Using the $2 here in the replacement string, since you search for a variable byte. 由于您搜索可变字节，因此在替换字符串中使用$ 2。 Otherwise a simpler str_replace would work. 否则，可以使用更简单的str_replace。

2.. Vice versa, how to search & replace this bit-pattern %11111111 00000000 111xxxxx with %11111111 111xxxxx ? 2.反之亦然，如何使用%11111111 111xxxxx搜索和替换此位模式%11111111 00000000 111xxxxx %11111111 111xxxxx ？

Same trick. 同样的把戏。

preg_replace(':([\\xFF])([\\x00])([\\xE0-\\xFF]):', "$1$3");

I would only watch out to always use the \\ double backslash, so it is PCRE which interpretets the \\x00 hex sequences, not the PHP parser. 我只会注意始终使用\\双反斜杠，因此PCRE会解释\\x00十六进制序列，而不是PHP解析器。 (It would end up becoming a C string terminator before it reaches libpcre.) （在到达libpcre之前，它将最终成为C字符串终止符。）

针对mp3文件中ID3v2不同步方案的正则表达式？

问题描述

According to ID3v2 official documentation : 根据ID3v2官方文档：

My questions are: 我的问题是：

One more thing... 还有一件事...

Update 更新

1 个解决方案

解决方案1
1 已采纳 2011-04-19 07:56:51

针对mp3文件中ID3v2不同步方案的正则表达式？

问题描述

According to ID3v2 official documentation : 根据ID3v2官方文档 ：

My questions are: 我的问题是：

One more thing... 还有一件事...

Update 更新

1 个解决方案

解决方案1 1 已采纳 2011-04-19 07:56:51

According to ID3v2 official documentation : 根据ID3v2官方文档：

解决方案1
1 已采纳 2011-04-19 07:56:51