简体   繁体   English

正则表达式:将字符串与具有相同模式的子字符串匹配

[英]Regex: Match string with substrings with the same pattern

I'm trying to match a string with a pattern, that can have sub strings with the same pattern. 我正在尝试将字符串与模式匹配,该模式可以具有相同模式的子字符串。

Here's a example string: 这是一个示例字符串:

Nicaragua [[NOTE|note|Congo was a member of ICCROM from 1999 and Nicaragua from 1971. Both were suspended by the ICCROM General Assembly in November 2013 having omitted to pay contributions for six consecutive calendar years (ICCROM [[Statutes|s|url|www.iccrom.org/about/statutes/]], article 9).]]. 尼加拉瓜[[注释|注释|刚果自1999年起成为ICCROM成员,1971年成为尼加拉瓜成员。2013年11月,ICCROM大会暂停了这两项成员,连续六年没有缴纳会费(ICCROM [[章程| | | url] | www.iccrom.org/about/statutes/]],第9条。]]。 Another [[link|url|google.com]] that might appear. 可能出现的另一个[[link | url | google.com]]。

and here's the pattern: 这是模式:

[[display_text|code|type|content]]

So, what I want with that is get the string within the brackets, and then look for some more string that match the pattern within the top level one. 所以,我想要的是在括号内获取字符串,然后查找更多与顶级模式匹配的字符串。

and what I want is match this: 而我想要的是匹配这个:

  1. [[NOTE|s|note|Congo was a member of ICCROM from 1999 and Nicaragua from 1971. Both were suspended by the ICCROM General Assembly in November 2013 having omitted to pay contributions for six consecutive calendar years (ICCROM [[Statutes|s|url|www.iccrom.org/about/statutes/]], article 9).]] [[注| s |注释|刚果自1999年起成为ICCROM成员,1971年成为尼加拉瓜成员。2013年11月,ICCROM大会暂停了这两项成员,但连续六年没有缴纳会费(ICCROM [[章程| |] url | www.iccrom.org/about/statutes/]],第9条。)]

1.1 [[Statutes|s|url|www.iccrom.org/about/statutes/]] 1.1 [[章程| s | url | www.iccrom.org/about/statutes/]]

  1. [[link|s|url|google.com]] [链接| S |网址| google.com]

I was using this /(\\[\\[.*]])/ but it gets everything until the last ]] . 我正在使用这个/( /(\\[\\[.*]])/ ]]]]]]但它直到最后才获得所有]]

What I want with that is be able to identify the matched string and convert them to HTML elements, where |note| 我想要的是能够识别匹配的字符串并将它们转换为HTML元素,其中|note| is going to be a blockquote tag and |url| 将成为一个blockquote标签和|url| an a tag. a标签。 So, a blockquote tag can have link tag inside it. 因此,blockquote标记内部可以包含链接标记。

BTW, I'm using CoffeeScript to do that. 顺便说一下,我正在使用CoffeeScript来做到这一点。

Thanks in advance. 提前致谢。

In general, regex is not good at dealing with nested expressions. 通常,正则表达式不擅长处理嵌套表达式。 If you use greedy patterns, they'll match too much, and if you use non-greedy patterns, as @bjfletcher suggests, they'll match too little, stopping inside the outer content. 如果你使用贪婪的模式,它们会匹配太多,如果你使用非贪婪的模式,正如@bjfletcher建议的那样,它们匹配得太少,停在外部内容中。 The "traditional" approach here is a token-based parser, where you step through characters one by one and build an abstract syntax tree (AST) which you then reformat as desired. 这里的“传统”方法是一个基于令牌的解析器,您可以逐个遍历字符并构建一个抽象语法树(AST),然后根据需要重新格式化。

One slightly hacky approach I've used here is to convert the string to a JSON string, and let the JSON parser do the hard work of converting into nested objects: http://jsfiddle.net/t09q783d/1/ 我在这里使用的一种略微hacky方法是将字符串转换为JSON字符串,让JSON解析器完成转换为嵌套对象的艰苦工作: http//jsfiddle.net/t09q783d/1/

function toPoorMansAST(s) {
    // escape double-quotes, as they'll cause problems otherwise. This converts them
    // to unicode, which is safe for JSON parsing.
    s = s.replace(/"/g, "\u0022");
    // Transform to a JSON string!
    s =
        // Wrap in array delimiters
        ('["' + s + '"]')
        // replace token starts
        .replace(/\[\[([^\|]+)\|([^\|]+)\|([^\|]+)\|/g,
             '",{"display_text":"$1","code":"$2","type":"$3","content":["')
        // replace token ends
        .replace(/\]\]/g, '"]},"');

    return JSON.parse(s);
}

This gives you an array of strings and structured objects, which you can then run through a formatter to spit out the HTML you'd like. 这将为您提供一个字符串和结构化对象的数组,然后您可以通过格式化程序来运行您想要的HTML。 The formatter is left as an exercise for the user :). 格式化程序留给用户:)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM