使用Java Regex匹配字母字符，且不带百分号

Question

tl;dr: tl; dr：

I want to take a string like: ab%cde%fg hij %klm n%op 我想要一个像这样的字符串： ab%cde%fg hij %klm n%op

And convert it to any of (all are acceptable): 并将其转换为任何（都可以接受）：

'ab'%c'de'%f'g hij '%k'lm n'%o'p'
'ab'%c'de'%f'g' 'hij' %k'lm' 'n'%o'p'
'a''b'%c'd''e'%f'g' 'h''i''j' %k'l''m' 'n'%o'p'

(if an alphabetical character is not preceded by a % , it needs to be within single quotes. Opening and closing extra single quotes is acceptable) （如果字母字符前没有% ，则必须将其放在单引号内。可以使用开和闭多余的单引号）

Use Case 用例

I'm trying to take a string in C strftime format and convert it to work with Java's SimpleDateFormat . 我正在尝试采用C strftime格式的字符串，并将其转换为可与Java的SimpleDateFormat 。 For the most part, this is pretty straight forward: 在大多数情况下，这很简单：

String format = "%y-%m-%d %H:%M:%S";

Map<String, String> replacements = new HashMap<String, String>() {{
    put("%a", "EEE");
    put("%A", "EEEE");
    put("%b", "MMM");
    put("%B", "MMMM");
    put("%c", "EEE MMM dd HH:mm:ss yyyy");
    // ... for each strftime token, create a mapping ...
}};

for ( String key : replacements.keySet() )
{
    // apply the mappings one at a time
    format = format.replaceAll(key, replacements.get(key));
}

// Then format
SimpleDateFormat df = new SimpleDateFormat(format, Locale.getDefault());
System.out.println(df.format(Calendar.getInstance().getTime()));

However when I introduce character literals, it runs into issues. 但是，当我介绍字符文字时，就会遇到问题。 According to the strftime documentation, all character literal not preceded by a percent sign are passed along without modification to the output string. 根据strftime文档，所有不带百分号的字符文字都将传递而不修改输出字符串。 So: 所以：

Format: "%y is a great year!"
Output: "2019 is a great year!"

However with SimpleDateFormat , all character literals are treated as tokens unless surrounded by single quotes: 但是，对于SimpleDateFormat ，所有字符文字均被视为标记，除非用单引号引起来：

Format: "yyyy 'is a great year!'"
Output: "2019 is a great year!"

Format: "yyyy is a great year!"
Output: ERROR - invalid token "i"

Desired Output 期望的输出

Because strftime tokens are always a single character , it shouldn't be too difficult to fix our format string. 由于strftime令牌始终是单个字符 ，因此修复我们的格式字符串应该不会太困难。 In a worst case scenario, "if a letter is not preceded by a % sign, wrap it in single quotes", which would lead to: 在最坏的情况下，“如果一个字母不是由前面%的标志，把它包在单引号”，这将导致：

Format: "%y is a great year!"
Processed: "%y 'i''s' 'a' 'g''r''e''a''t' 'y''e''a''r'!"

This is ugly, but would behave as expected and is an acceptable answer. 这很丑陋，但会达到预期效果，并且是可以接受的答案。 Ideally we would wrap all runs of alphabetical characters not preceded by a % , like so: 理想情况下，我们将包装所有不包含%的字母字符 ，如下所示：

Format: "%y is a great year!"
Processed: "%y 'is' 'a' 'great' 'year'!"

Or, better yet, all runs including non-alpha and non- % characters : 或者更好的是，所有运行都包括非字母和非%字符 ：

Format: "%y is a great year!"
Processed: "%y' is a great year!'"

What I've tried 我尝试过的

I started with a mindless regular expression that I was pretty sure wouldn't work, and it didn't: 我从一个毫无头脑的正则表达式开始，我很确定那是行不通的，并且没有：

format.replaceAll("[^%]([a-zA-Z]+)", "'$1'");
// Format:   "Literal %t Literal"
// Output:   "'iteral' %t'Literal'"
// Expected: "'Literal' %t 'Literal'"

I don't have a firm grasp on back-references so I gave them a whirl but messed something up there as well: 我对后向引用没有足够的了解，所以我给了他们一个旋转，但也弄乱了那里的东西：

format.replaceAll("(?!%)([a-zA-Z]+)", "'$1'");
// Format:   "Literal %t Literal"
// Output:   "'Literal' %'t' 'Literal'"
// Expected: "'Literal' %t 'Literal'"

I also considered writing a very simple lexer. 我还考虑过编写一个非常简单的词法分析器。 Something like: 就像是：

StringBuffer s = new StringBuffer();
boolean inQuote = false;
for (int i = 0; i < format.length; i++)
{
    if (format[i] == '%')
    {
        i++;
        s.append(replacements.get(format[i]);
    }
    else if (inQuote)
    {
        s.append(format[i]);
    }
    else
    {
        s.append("'");
        inQuote = true;
        s.append(format[i]);
    }
}

However I learned that format[i] isn't valid Java syntax, and didn't spend much time looking into how to properly get a character from a string before I decided to just post here. 但是，我了解到format[i]不是有效的Java语法，并且在我决定只在此处发表文章之前，没有花太多时间研究如何从字符串中正确获取字符。

I would prefer a regular expression solution so that I can write it in a single line instead of a loop like this. 我希望使用正则表达式解决方案，以便可以将其写在一行中，而不是像这样的循环。

Answer 1

This has been updated to work with a single regex. 已更新为可使用单个正则表达式。 Additional formats can be added to test for correctness. 可以添加其他格式以测试正确性。

      String[] formats = { "ab%cde%fg hij %klm n%op", "ab%c", "%d"
      };
      for (String f : formats) {
         String parsed = f.replaceAll("(^[a-z]+|(?<=%[a-z])([a-z ]+))", "'$1'");
         System.out.println(parsed);
      }

The two possibilities are: 两种可能性是：

Put all characters [az]+ that follow %[az] between single quotes. 将在%[az]所有字符[az]+放在单引号之间。
Place any characters that precede % and not included above between single quotes. 将所有%之前且以上未包括的字符放在单引号之间。

Answer 2

Why not use several replaceAll functions since you have already considered it. 既然已经考虑过，为什么不使用几个replaceAll函数呢？

First, add single quotes to all consecutive character strings; 首先，对所有连续的字符串添加单引号；

Then, move the single quote preceded by % by one character; 然后，将单引号前跟％移一个字符；

Last, remove empty quotes. 最后，删除空引号。

Below is my testing code in Python. 以下是我在Python中的测试代码。 I believe it works in other languages such as Java as well. 我相信它也可以在其他语言（例如Java）中使用。

>>> str1=re.sub("([a-zA-Z]+)","'\g<1>'",input)
>>> str2=re.sub("%'([a-zA-Z])'","%\g<1>",str1)
>>> str3=re.sub("''","",str2)
>>> str1
"'Literal' %'t' 'Literal'"
>>> str2
"'Literal' %t 'Literal'"
>>> str3
"'Literal' %t 'Literal'"

使用Java Regex匹配字母字符，且不带百分号

问题描述

tl;dr: tl; dr：

Use Case 用例

Desired Output 期望的输出

What I've tried 我尝试过的

2 个解决方案

解决方案1
1 已采纳 2019-05-16 20:32:15

解决方案2
1 2019-05-16 21:46:36

使用Java Regex匹配字母字符，且不带百分号

问题描述

tl;dr: tl; dr：

Use Case 用例

Desired Output 期望的输出

What I've tried 我尝试过的

2 个解决方案

解决方案1 1 已采纳 2019-05-16 20:32:15

解决方案2 1 2019-05-16 21:46:36

解决方案1
1 已采纳 2019-05-16 20:32:15

解决方案2
1 2019-05-16 21:46:36