简体   繁体   中英

Regex and escaped and unescaped delimiter

question related to this

I have a string

a\;b\\;c;d

which in Java looks like

String s = "a\\;b\\\\;c;d"

I need to split it by semicolon with following rules:

  1. If semicolon is preceded by backslash, it should not be treated as separator (between a and b ).

  2. If backslash itself is escaped and therefore does not escape itself semicolon, that semicolon should be separator (between b and c ).

So semicolon should be treated as separator if there is either zero or even number of backslashes before it.

For example above, I want to get following strings (double backslashes for java compiler):

a\;b\\
c
d

You can use the regex

(?:\\.|[^;\\]++)*

to match all text between unescaped semicolons:

List<String> matchList = new ArrayList<String>();
try {
    Pattern regex = Pattern.compile("(?:\\\\.|[^;\\\\]++)*");
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        matchList.add(regexMatcher.group());
    } 

Explanation:

(?:        # Match either...
 \\.       # any escaped character
|          # or...
 [^;\\]++  # any character(s) except semicolon or backslash; possessive match
)*         # Repeat any number of times.

The possessive match ( ++ ) is important to avoid catastrophic backtracking because of the nested quantifiers.

I do not trust to detect those cases with any kind of regular expression. I usually do a simple loop for such things, I'll sketch it using C since it's ages ago I last touched Java ;-)

int i, len, state;
char c;

for (len=myString.size(), state=0, i=0; i < len; i++) {
    c=myString[i];
    if (state == 0) {
       if (c == '\\') {
            state++;
       } else if (c == ';') {
           printf("; at offset %d", i);
       }
    } else {
        state--;
    }
}

The advantages are:

  1. you can execute semantic actions on each step.
  2. it's quite easy to port it to another language.
  3. you don't need to include the complete regex library just for this simple task, which adds to portability.
  4. it should be a lot faster than the regular expression matcher.

EDIT: I have added a complete C++ example for clarification.

#include <iostream>                                                             
#include <sstream>                                                              
#include <string>                                                               
#include <vector>                                                               
                                                                                
std::vector<std::string> unescapeString(const char* s)                        
{                                                                               
    std::vector<std::string> result;                                            
    std::stringstream ss;                                                       
    bool has_chars;                                                             
    int state;                                                                  
                                                                                
    for (has_chars = false, state = 0;;) {                                      
        auto c = *s++;                                                          
                                                                                
        if (state == 0) {                                                       
            if (!c) {                                                           
                if (has_chars) result.push_back(ss.str());                      
                break;                                                          
            } else if (c == '\\') {                                             
                ++state;                                                        
            } else if (c == ';') {                                              
                if (has_chars) {                                                
                    result.push_back(ss.str());                                 
                    has_chars = false;                                          
                    ss.str("");                                                 
                }                                                               
            } else {                                                            
                ss << c;                                                        
                has_chars = true;                                               
            }                                                                   
        } else /* if (state == 1) */ {                                          
            if (!c) {                                                           
                ss << '\\';                                                     
                result.push_back(ss.str());                                     
                break;                                                          
            }                                                                   
                                                                                
            ss << c;                                                            
            has_chars = true;                                                   
            --state;                                                            
        }                                                                       
    }                                                                           
                                                                                
    return result;                                                              
}                                                                               
                                                                                
int main(int argc, char* argv[])                                                
{                                                                               
    for (size_t i = 1; i < argc; ++i) {                                         
        for (const auto& s: unescapeString(argv[i])) {                          
            std::cout << s << std::endl;                                        
        }                                                                       
    }                                                                           
}                                                     

This is the real answer i think. In my case i am trying to split using |and escape character is & .

    final String regx = "(?<!((?:[^&]|^)(&&){0,10000}&))\\|";
    String[] res = "&|aa|aa|&|&&&|&&|s||||e|".split(regx);
    System.out.println(Arrays.toString(res));

In this code i am using Lookbehind to escape & character. note that the look behind must have maximum length.

(?<!((?:[^&]|^)(&&){0,10000}&))\\|

this means any | except those that are following ((?:[^&]|^)(&&){0,10000}&)) and this part means any odd number of & s. the part (?:[^&]|^) is important to make sure that you are counting all of the & s behind the | to the beginning or some other characters.

String[] splitArray = subjectString.split("(?<!(?<!\\\\)\\\\);");

This should work.

Explanation :

// (?<!(?<!\\)\\);
// 
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!(?<!\\)\\)»
//    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\)»
//       Match the character “\” literally «\\»
//    Match the character “\” literally «\\»
// Match the character “;” literally «;»

So you just match the semicolons not preceded by exactly one \ .

EDIT :

String[] splitArray = subjectString.split("(?<!(?<!\\\\(\\\\\\\\){0,2000000})\\\\);");

This will take care of any odd number of . It will of course fail if you have more than 4000000 number of \. Explanation of edited answer :

// (?<!(?<!\\(\\\\){0,2000000})\\);
// 
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!(?<!\\(\\\\){0,2000000})\\)»
//    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\(\\\\){0,2000000})»
//       Match the character “\” literally «\\»
//       Match the regular expression below and capture its match into backreference number 1 «(\\\\){0,2000000}»
//          Between zero and 2000000 times, as many times as possible, giving back as needed (greedy) «{0,2000000}»
//          Note: You repeated the capturing group itself.  The group will capture only the last iteration.  Put a capturing group around the repeated group to capture all iterations. «{0,2000000}»
//          Match the character “\” literally «\\»
//          Match the character “\” literally «\\»
//    Match the character “\” literally «\\»
// Match the character “;” literally «;»

This approach assumes that your string will not have char '\0' in your string. If you do, you can use some other char.

public static String[] split(String s) {
    String[] result = s.replaceAll("([^\\\\])\\\\;", "$1\0").split(";");
    for (int i = 0; i < result.length; i++) {
        result[i] = result[i].replaceAll("\0", "\\\\;");
    }
    return result;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM