简体   繁体   English

java 中 escaping 标记字符的正则表达式

[英]Regex for escaping markup characters in java

I have saved messages with markup.我保存了带有标记的消息。 it is using just a few markup functionalities for markup like bold, italic, strikethrough and code.它只使用了一些标记功能,如粗体、斜体、删除线和代码。 What I need now is to strip those messages from any kind of markup, except for when in code.我现在需要的是从任何类型的标记中删除这些消息,除了在代码中。 For example:例如:

**bold**  _italic_ ~strike~ `**code**`

would return:会返回:

bold italic strike **code**

I currently use regex like this one for bold:我目前使用像这样的正则表达式来表示粗体:

\*\*([^*]*)\*\*(?=(?:[^`\\]*(?:\\.|`(?:[^`\\]*\\.)*[^`\\]*`))*[^`]*$)

to strip my message from the formatting, but I am having a problem with composite markups, where multiple formatting is applied to the string like bold and italic at the same time.从格式中删除我的消息,但我遇到了复合标记的问题,其中多种格式同时应用于粗体和斜体等字符串。

**_bold and italic_**

Is there any way to strip that kind of message and something that would simplify the regex I am using?有什么方法可以去除这种消息以及可以简化我正在使用的正则表达式的东西吗?

You can use您可以使用

replaceAll("`([^`]*)`|(\\*\\*|[_~])((?:(?!\\2).)*)\\2", "$1$3")

See the regex demo .请参阅正则表达式演示 Details :详情

  • `([^`]*)` - backtick, then any zero or more non-backtick chars (captured into Group 1) and then a backtick `([^`]*)` - 反引号,然后是任何零个或多个非反引号字符(捕获到第 1 组),然后是反引号
  • | - or - 或者
  • (\*\*|[_~]) - Group 2: ** , _ or ~ (\*\*|[_~]) - 第 2 组: **_~
  • ((?:(?.\\2).)*) - Group 3: any char, zero or more occurrences but as many as possible that does not start with the char sequence captured into Group 2 ((?:(?.\\2).)*) - 第 3 组:任何字符,零次或多次出现,但尽可能多的不是以捕获到第 2 组的字符序列开头
  • \2 - Group 2 value. \2 - 第 2 组值。

See the Java demo :请参阅Java 演示

String s = "**bold**  _italic_ ~strike~ `**code**`";
String regex = "(?s)`([^`]*)`|(\\*\\*|[_~])((?:(?!\\2).)*)\\2";
System.out.println(s.replaceAll(regex, "$1$3")); 
// => bold  italic strike **code**

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM