简体   繁体   中英

TSV interpreter in java

I am creating a piece of java code to read and interpret a tsv file. I would like to find a regular expression that is able to split the lines within the file knowing:

  • Items are separated by tabs
  • Strings are surrounded by quotes
  • Numbers are not surrounded by quotes
  • Quotes can contain quotes, which will be escaped by quotes (ie double quotes "" )
  • Strings can contain tabs

Sample input lines:

"aaa"    123    "bbb"    "cc"    "ddd"
"aaa"    123    "bbb"    "cc"    "    6"
"ddd"    456    "eee"    "ff"    "       ""     "
"ddd"    456    "eee"    "ff"    "    "" aaa ""   "

* (please note: tabs in last three string)

My current regex is ("[^"]*"*|[^\\t]+)+ , but that fails on the last example (makes smaller substring)

Lets settle the case:

\\t(?=(?:\\[^\\"\\]*\\"\\[^\\"\\]*\\")*\\[^\\"\\]*$) (click on the link to get a description of the pattern)

Sample code: ideone demo

import java.util.regex.Pattern;
public class example {
  public static void main(String[] asd){
  String sourcestring = "\"aaa\"    123 \"bbb\" \"cc\"  \"ddd\"\n"
             + "\"aaa\" 123 \"bbb\" \"cc\"  \"  6\"\n"
             + "\"ddd\" 456 \"eee\" \"ff\"  \"          \"\"     \"\n"
             + "\"ddd\" 456 \"eee\" \"ff\"  \"  \"\" aaa \"\"   \"";
  Pattern reLines = Pattern.compile("\\n");          
  Pattern reTsv = Pattern.compile("\\t(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)");
  String[] lines = reLines.split(sourcestring);
  for(int linesIdx = 0; linesIdx < lines.length; linesIdx++ ) {
    String[] parts = reTsv.split(lines[linesIdx]);
    for(int partsIdx = 0; partsIdx < parts.length; partsIdx++ ) {
        System.out.println( "[" + partsIdx + "] = " + parts[partsIdx]);
      }
    }
  }
}

Output:

[0] = "aaa"
[1] = 123
[2] = "bbb"
[3] = "cc"
[4] = "ddd"
[0] = "aaa"
[1] = 123
[2] = "bbb"
[3] = "cc"
[4] = "  6"
[0] = "ddd"
[1] = 456
[2] = "eee"
[3] = "ff"
[4] = "         ""     "
[0] = "ddd"
[1] = 456
[2] = "eee"
[3] = "ff"
[4] = " "" aaa ""   "

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM