简体   繁体   中英

Regular Expression to preserve quotes , single quotes , hyphens and split at white space

I use Java Pattern class to specify the regex as a string.

So example I love being spider-man : "Peter Parker"

should list spider-man and "Peter Parker" as a separate token. Thanks

try {
     BufferedReader br = new BufferedReader(new FileReader(f));
     StringBuilder sb = new StringBuilder();
     String line = br.readLine();

     while (line != null) {
        sb.append(line);
        line = br.readLine();
     }

    String everything = sb.toString();        
    List<String> result = new ArrayList<String>();
    Pattern pat = Pattern.compile("([\"'].*?[\"']|[^ ]+)");
    PatternTokenizer pt = new PatternTokenizer(new StringReader(everything),pat,0);
    while (pt.incrementToken()) {
     result.add(pt.getAttribute(CharTermAttribute.class).toString());

     }

 }
    catch (Exception e) {
    throw new RuntimeException(e);
   }

So i guess the reason why "some word" is not working is because each token is itself a string. Any cues ? Thank you

If it doesn't have to be regex and your data in String is correct (quotes are in right order not like " ' some data " ' ) then you can do it in one iteration like

String data="I love being spider-man : \"Peter Parker\" or 'photo reporter'";

List<String> tokens = new ArrayList<String>();
StringBuilder sb=new StringBuilder();
boolean inSingleQuote=false;
boolean indDoubleQuote=false;

for (char c:data.toCharArray()){
    if (c=='\'') inSingleQuote=!inSingleQuote;
    if (c=='"') indDoubleQuote=!indDoubleQuote;
    if (c==' ' && !inSingleQuote && !indDoubleQuote){
        tokens.add(sb.toString());
        sb.delete(0,sb.length());
    }
    else 
        sb.append(c);
}
tokens.add(sb.toString());
System.out.println(tokens);

output

[I, love, being, spider-man, :, "Peter Parker", or, 'photo reporter']

Check whether this regex is what you need:

"([\"'].*?[\"']|(?<=[ :]|^)[a-zA-Z0-9-]+(?=[ :]|$))"

I assume that you don't have (single/double) quote inside (single/double) quote.

There is also assumption about the delimiter: I only allow space and : to work as delimiter. Nothing will be matched in "foo_bar" . If you want to add more delimiter, such as ; , . , , , ? , add it to the character class in both look ahead and look behind assertion, like this:

"([\"'].*?[\"']|(?<=[ :;.,?]|^)[a-zA-Z0-9-]+(?=[ :;.,?]|$))"

Not yet tested on every input, but I have tested on this input:

"    sdfsdf \" sdfs  sdfsdfs \"   \"sdfsdf\"  sdfsdf   sdfsd  dsfshj sdfsdf-sdf  'sdfsdfsdf  sd f '  "
// I used replaceAll to check the captured group
.replaceAll("([\"'].*?[\"']|(?<=[ :]|^)[a-zA-Z0-9-]+(?=[ :]|$))", "X$1Y")

And it works fine for me.

If you want a more liberal capturing, but still with the assumption about quoting:

"([\"'].*?[\"']|[^ ]+)"

To extract matches:

Matcher m = Pattern.compile(regex).matcher(inputString);
List<String> tokens = new ArrayList<String>();
while (m.find()) {
    tokens.add(m.group(1));
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM