简体   繁体   English

简单(!?)正则表达式捕获指定字符对之间的所有内容

[英]Simple (!?) regex to capture everything between specified pairs of characters

I need to match everything between ' [~ ' and ' ~] ' tags. 我需要匹配' [〜 '和' 〜] '标签之间的所有内容。

Tried to write a lot of regex patterns but couldn't find correct one: 试图编写很多正则表达式模式但找不到正确的模式:

  1. #\\[~(.*)~]# - this returns everything between first occurrence of [~ and last occurrence of ~] . #\\[~(.*)~]# -返回从[〜的第一次出现到〜]的最后一次出现之间的所有内容。
  2. #\\[~([^~]*)~]# - this works fine if there are no ~ symbol inside tags. #\\[~([^~]*)~]# - 如果标签内没有符号,这个工作正常。

I understand that (.*) captures everything and ([^~]*) captures everything until it finds ~ character but I cant make it to capture everything until it finds ~] pair (any byte excepting ~] pair is possible inside tags including single ~ character). 我知道(.*)捕获所有内容,而([^~]*)捕获所有内容,直到找到字符为止,但是我无法使其捕获所有内容,直到找到〜]对(在标签内,除〜]对以外的任何字节都可以,包括单身人物)。 In other words, I dont know how to make negation against the pair of characters. 换句话说,我不知道如何对这对角色做出否定。

This is possible example: 这是可能的例子:

Simple [example~]: [~here I can face both, ‘~’ and ‘]’ characters~] or another
example [~~~~~~[ABC]~~~~~~]. 

After preg_match_all() against regex I expect resulting array like this: 在对正则表达式进行preg_match_all()之后 ,我希望得到这样的数组:

array(2) {
  [0]=>
  string(44) "here I can face both, '~' and "]" characters"
  [1]=>
  string(14) "~~~~~[ABC]~~~~~"
}

Note: Input string may contain binary data (00-FF). 注意:输入字符串可能包含二进制数据(00-FF)。

Just to mention (for certain people here), I've already checked out all related Q/A + hundreds of Google search results. 仅举几句(对于这里的某些人),我已经查看了所有相关的Q / A +数百个Google搜索结果。

* is greedy, so it takes as much as it can. *是贪婪的,所以它需要尽可能多。 You can make it non-greedy (add a ? ) which should solve your issue. 你可以让它非贪婪(添加? ),这将解决你的问题。

#\[~(.*?)~]#

The following website has a good description and explains it in more detail: Repetition with Star and Plus . 以下网站有一个很好的描述,并更详细地解释: 重复Star和Plus

preg_match deals with binary strings pretty well, the . preg_match很好地处理二进制字符串. matches any character which reads as byte if you're in the standard mode (non-utf8) - as you are. 匹配任何字符,如果你处于标准模式(非utf8),则读取为字节 - 就像你一样。


Simplified example for explanation: 解释的简化示例:

 aab ::  a*  -> aa

Matches first an empty string, then a, then aa and then aab does not match so the last match aa is taken and returned. 首先匹配一个空字符串,然后是a,然后是aa然后aab不匹配,所以最后一个匹配aa被采用并返回。 As you can see the engine had first internally three valid matches: empty string, a and aa. 正如您所看到的,引擎首先在内部有三个有效匹配:空字符串,a和aa。 The last one wins in greedy-mode. 最后一个在贪婪模式中获胜。

 aab ::  a*? -> (empty string)

Is at first position. 是第一个位置。 Needs 0 or more a non-greedy. 需要0或更多非贪婪。 First position is zero or more a, so matches an empty string and returns. 第一个位置为零或更多a,因此匹配空字符串并返回。 The first one wins in non-greedy-mode. 第一个在非贪婪模式中获胜。

For UTF-8 strings, use the u modifier (PCRE8) : #.*#u - . 对于UTF-8字符串,请使用u修饰符(PCRE8)#.*#u - . matches any UTF-8 character (which can be one or more bytes). 匹配任何UTF-8字符(可以是一个或多个字节)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM