简体   繁体   中英

Split a text file which has multiple split string

I am trying to read a text file which has delimiters of space and as well as double quotes and it is there is not a easy way to identify this scenario, I just wanted to check if this can be achieved using predefined Regular expression otherwise I need to start working on custom split

Here is the string

"myfile-one two" "1" 3 1453454.00 -134557.63 585.0 24444.8 -999 "NULL" "" 45.60 "" 67°32'5.23455"N 54°56'65.3454"W "NULL" 6.00

The output should be

myfile-one two
1
3
1453454.00
-134557.63
585.0
24444.8
-999
NULL
45.60

67°32'5.23455"N
54°56'65.3454"W
NULL
6.00

below code try to first split into space delimiter and this split even within the double quotes as well and made as separate entry

char[] space = new Char[] { ' ' };

string[] data = comp.Split(space, StringSplitOptions.RemoveEmptyEntries);

You may match any substrings between double quotes that are not enclosed with whitespaces and capture what is inside them into a named group, or match any 1+ non-whitespace chars and capture into the indentically named group and use

var results = Regex.Matches(str, @"(?<!\S)""(?<o>.*?)""(?!\S)|(?<o>\S+)")
                .Cast<Match>()
                .Select(m => m.Groups["o"].Value)
                .ToList();

See the regex demo .

Pattern details

  • (?<!\\S) - a whitespace or start of string is required immediately to the left of the current location
  • " - a double quotation mark
  • (?<o>.*?) - Group "o": any 0+ chars other than newline, as few as possible
  • " - a double quotation mark
  • (?!\\S) - a whitespace or end of string is required immediately to the right of the current location
  • | - or
  • (?<o>\\S+) - Group "o": any 1+ non-whitespace chars.

.NET allows the use of the identically named groups inside one regex pattern accumulating the values found into the corresponding memory buffer that you may "collect" via .Select(m => m.Groups["o"].Value) .

Since regex is impacting performance heavily and the described scenario is quite simple, I would like to offer a short, fast and regex free solution, that makes use of string members only. In addition, the regex free approach is by far more readable and more robust.

// The escaped input string
var input = @"""myfile-one two"" ""1"" 3 1453454.00 -134557.63 585.0 24444.8 -999 ""NULL"" """" 45.60 """" 67°32'5.23455""N 54°56'65.3454""W ""NULL"" 6.00 ";

List<string> cleanedInputTokens = input
  .Split(new []{' '}, StringSplitOptions.RemoveEmptyEntries)
  .Select(token => token.Trim('"'))
  .ToList();

The algorithm first splits the input into tokens and then trims leading and trailing specified characters. Because Split(Char[], StringSplitOptions) and Trim(Char[]) both accept an array of characters, this pattern is also extensible and flexible.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM