简体   繁体   中英

Regex to select semicolons that are not enclosed in double quotes

I have string like

a;b;"aaa;;;bccc";deef

I want to split string based on delimiter ; only if ; is not inside double quotes. So after the split, it will be

 a
 b
"aaa;;;bccc"
 deef

I tried using look-behind, but I'm not able to find a correct regular expression for splitting.

Regular expressions are probably not the right tool for this. If possible you should use a CSV library, specify ; as the delimiter and " as the quote character, this should give you the exact fields you are looking for.

That being said here is one approach that works by ensuring that there are an even number of quotation marks between the ; we are considering the split at and the end of the string.

;(?=(([^"]*"){2})*[^"]*$)

Example: http://www.rubular.com/r/RyLQyR8F19

This will break down if you can have escaped quotation marks within a string, for example a;"foo\\"bar";c .

Here is a much cleaner example using Python's csv module :

import csv, StringIO
reader = csv.reader(StringIO.StringIO('a;b;"aaa;;;bccc";deef'),
                    delimiter=';', quotechar='"')
for row in reader:
    print '\n'.join(row)

This is kind of ugly, but if you don't have \\" inside your quoted strings (meaning you don't have strings that look like this ("foo bar \\"badoo\\" goo") you can split on the " first and then assume that all your even numbered array elements are, in fact, strings (and split the odd numbered elements into their component parts on the ; token).

If you *do have \\" in your strings, then you'll want to first convert those into some other temporary token that you'll convert back later after you've performed your operation.

Here's a fiddle...

http://jsfiddle.net/VW9an/

    var str = 'abc;def;ghi"some other dogs say \\"bow; wow; wow\\". yes they do!"and another; and a fifth'

var strCp = str.replace(/\\"/g,"--##--");

var parts = strCp.split(/"/);

var allPieces = new Array();
for(var i in parts){
    if(i % 2 == 0){
        var innerParts = parts[i].split(/\;/)
        for(var j in innerParts)
            allPieces.push(innerParts[j])
    }
    else{
        allPieces.push('"' + parts[i] +'"')
    }
}

for(var a in allPieces){
 allPieces[a] = allPieces[a].replace(/--##--/g,'\\"');   
}

console.log(allPieces)

Regular expression will only get messier and break on even minor changes. You are better off using a csv parser with any scripting language. Perl built in module (so you don't need to download from CPAN if there are any restrictions) called Text::ParseWords allows you to specify the delimiter so that you are not limited to , . Here is a sample snippet:

#!/usr/local/bin/perl

use strict;
use warnings;

use Text::ParseWords;

my $string = 'a;b;"aaa;;;bccc";deef';
my @ary = parse_line(q{;}, 0, $string);

print "$_\n" for @ary;

Output

a
b
aaa;;;bccc
deef

Match All instead of Splitting

Answering long after the battle because no one used the way that seems the simplest to me.

Once you understand that Match All and Split are Two Sides of the Same Coin , you can use this simple regex:

"[^"]*"|[^";]+

See the matches in the Regex Demo .

  • The left side of the alternation | matches full quoted strings
  • The right side matches any chars that are neither ; nor "

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM