简体   繁体   中英

Regex doesn't give me expected result

Okay, I give up - time to call upon the regex gurus for some help.

I'm trying to validate CSV file contents, just to see if it looks like the expected valid CSV data. I'm not trying to validate all possible CSV forms, just that it "looks like" CSV data and isn't binary data, a code file or whatever.

Each line of data comprises comma-separated words, each word comprising az , 0-9 , and a small number of of punctuation chars, namely - and _ . There may be several lines in the file. That's it.

Here's my simple code:

const string dataWord = @"[a-z0-9_\-]+";
const string dataLine = "("+dataWord+@"\s*,\s*)*"+dataWord;
const string csvDataFormat = "("+dataLine+") |  (("+dataLine+@"\r\n)*"+dataLine +")";

Regex validCSVDataPattern = new Regex(csvDataFormat, RegexOptions.IgnoreCase);
protected override bool IsCorrectDataFormat(string fileContents)
{
    return validCSVDataPattern.IsMatch(fileContents);
}

This gives me a regex pattern of

(([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+) |  ((([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+\r\n)*([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+)

However if I present this with a block of, say, C# code, the regex parser says it is a match. How is that? the C# code doesn't look anything like my CSV pattern (it has punctuation other than _ and - , for a start).

Can anyone point out my obvious error? Let me repeat - I am not trying to validate all possible CSV forms, just my simple subset.

Your regular expression is missing the ^ (beginning of line) and $ (end of line) anchors. This means that it would match any text that contains what is described by the expression, even if the text contains other completely unrelated parts.

For example, this text matches the expression:

foo, bar

and therefore this text also matches:

var result = calculate(foo, bar);

You can see where this is going.

Add ^ at the beginning and $ at the end of csvDataFormat to get the behavior you expect.

Here is a better pattern which looks for CSV groups such as XXX, or yyy for one to many in each line:

^([\w\s_\-]*,?)+$

^ - Start of each line

( - a CSV match group start

[\\w\\s_\\-]* - Valid characters \\w (a-zA-Z0-9) and _ and - in each CSV

,? - maybe a comma

)+ - End of the csv match group, 1 to many of these expected.

That will validate a whole file, line by line for a basic CSV structure and allow for empty ,, situations.

I came up with this regex:

^([a-z0-9_\-]+)(\s*)(,\s*[a-z0-9_\-]+)*$

Tests

asbc_- ,   khkhkjh,    lkjlkjlkj_-,     j : PASS
asbc,                                     : FAIL
asbc_-,khkhkjh,lkjlkjlk909j_-,j           : PASS

If you want to match empty lines like ,,, or when some values are blank like ,abcd,, use

^([a-z0-9_\-]*)(\s*)(,\s*[a-z0-9_\-]*)*$

Loop through all the lines to see if the file is ok:

const string dataLine = "^([a-z0-9_\-]+)(\s*)(,\s*[a-z0-9_\-]+)*$";
Regex validCSVDataPattern = new Regex(csvDataFormat, RegexOptions.IgnoreCase);
protected override bool IsCorrectDataFormat(string fileContents)
{
    string[] lines = fileContents.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);

    foreach (var line in lines)
    {
        if (!validCSVDataPattern.IsMatch(line))
        return false;
    }

    return true;
}

I think this is what you're looking for:

@"(?in)^[a-z0-9_-]+( *, *[a-z0-9_-]+)*([\r\n]+[a-z0-9_-]+( *, *[a-z0-9_-]+)*)*$"

The noteworthy changes are:

  • Added anchors ( ^ and $ , because the regex is totally pointless without them
  • Removed spaces (which have to match literal spaces, and I don't think that's what you intended)
  • Replaced the \\s in every occurrence of \\s* with a literal space (because \\s can match any whitespace character, and you only want to match actual spaces in those spots)

The basic structure of your regex looked pretty good until that | came along and bollixed things up. ;)

ps, In case you're wondering, (?in) is an inline modifier that sets IgnoreCase and ExplicitCapture modes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM