简体   繁体   中英

Java: Parsing Strings with extra quotes

Alternate problem title: Splitting a comma delimited list that is inside of a tab delimited list.

I'm looking for a solution that does not involve other packages other than standard java routines. This has got to be something that has been solved before, I just don't know which keywords to use on Stackoverflow to find it!

I have a tab delimited file that I am parsing. I perform error checking on the fields after splitting the line to prevent bad data getting into my program. I pretty much have everything solved except for one field. The basic layout of the input line is:

field1<tab>field2<tab>field3<tab>field4

field3, by design can contain:

  1. Empty string:

     field1<tab>field2<tab><tab>field4 
  2. One string, with or without blanks:

     field1<tab>field2<tab>Fred Flintstone<tab>field4 
  3. Multiple strings separated by commas:

     field1<tab>field2<tab>Fred, Barney, Wilma<tab>field4 

The line is read and split as follows:

    String entry = pq2File.readline();
    String[] temp;
    temp = entry.split("\t", 4);

When I split the input line by "\\t" my third field (temp[2]) is set equal as follows in each of the cases above:

  1. []
  2. [Fred Flintstone]
  3. [Fred, Barney, Wilma]

I then split field3 again by ","

ArrayList<String> names = 
     new ArrayList<String>(Arrays.asList(temp[2].split(",")));

giving me the following values in the ArrayList names, in each of the cases above

  1. [empty]
  2. Fred Flintstone
  3. Fred
    Barney
    Wilma

All this is handled correctly when I use text editor to create the file, or SQL statements to pull the data out of an external, remote system to which I do not have access. The problem comes in with a user that insists on using MS EXCEL to create the file. In this case the line looks like this:

field1<tab>field2<tab>"Fred, Barney, Wilma"<tab>field4

When I parse the line, my variable gets the value

"Fred, Barney, Wilma"

And splitting it by "," results in:
"Fred
Barney
Wilma"

Obviously I want to get rid the extra " marks. Am I looking for a solution to remove the " marks before I split the field? or does it make more sense (less code) to wait until after the field is split, and then just look at the first and last items. I ask because it is possible that the line could be:

field1<tab>field2<tab>"Fred Flintstone", "Barney Rubble", "Wilma Flintstone"<tab>field4 

In this case I would expect temp[2] to become:

"Fred Flintstone", "Barney Rubble", "Wilma Flintstone"

and the resulting split of temp[2] should result in:
"Fred Flintstone"
"Barney Rubble"
"Wilma Flintstone"

which would be fine.

Edit The design team has been consulted and confirmed that for ALL fields, there can be no embedded tabs within the fields.

Further, they have confirmed that within field 3, there can be no embedded commas with an item within the field.

therefore, input such as:

field1<tab>field2<tab>"Fred Flintstone", "Barney, Wilma"<tab>field4 

should result in three entries for field3:

  • "Fred Flintstone"
  • "Barney
  • Wilam"

I am pressing them on another issue that may make this whole issue moot...

I think you want to

  • Split by comma
  • If ((first element starts with double-quote but does not end with double-_quote) and (last element ends with double-quote but does not start with double-quote)) then remove those doble-quotes

Still, I am wondering if there can be bad data, like

field1<tab>field2<tab>"Fred Flintstone", "Barney, Wilma"<tab>field4 

Resulting in all kinds of dirty data. You might want rigorously define the grammar instead of using examples, at which point the parsing should become trivial.

I recommend you to code a specific parser in two levels:

  • The outer level should stop at every occurrence of TAB.
  • The inner level should stop at every occurrence of comma, and discard first character quote and last character quote.

And not to sound so theoretically, I post here my proposal:

public class CombinedStringParser
{
    private final String src;

    private final char delimitter;

    private int currentPos=0;

    public CombinedStringParser(String src, char delimitter)
    {
        super();
        this.src=src;
        this.delimitter=delimitter;
    }

    public String nextToken()
    {
        int initialPos=this.currentPos;
        int x=0;
        while (this.currentPos < this.src.length())
        {
            char c=this.src.charAt(this.currentPos++);
            if (c == this.delimitter)
            {
                x=-1;
                break;
            }
        }
        return this.src.substring(initialPos, this.currentPos + x);
    }

    public List<String> nextListOfTokens(char listDelimitter)
    {
        int initialPos=this.currentPos;
        List<String> list=new ArrayList<String>();
        while (this.currentPos < this.src.length())
        {
            char c=this.src.charAt(this.currentPos++);
            if (c == this.delimitter)
            {
                break;
            }
            else
            {
                if (c == listDelimitter)
                {
                    int p1=initialPos;
                    int p2=this.currentPos - 1;
                    if (this.src.charAt(p1) == '\"')
                    {
                        p1++;
                    }
                    if (this.src.charAt(p2 - 1) == '\"')
                    {
                        p2--;
                    }
                    list.add(this.src.substring(p1, p2));
                    initialPos=this.currentPos;
                }
            }
        }
        if (initialPos < this.currentPos)
        {
            int p1=initialPos;
            int p2=this.src.length();
            if (this.src.charAt(p1) == '\"')
            {
                p1++;
            }
            if (this.src.charAt(p2 - 1) == '\"')
            {
                p2--;
            }
            list.add(this.src.substring(p1, p2));
        }
        return list;
    }
}

How to use it:

CombinedStringParser parser=new CombinedStringParser(src, '\t');
String firstToken=parser.nextToken();
String secondToken=parser.nextToken();
List<String> thirdToken=parser.nextListOfTokens(',');
String fourthToken=parser.nextToken();

Appart from being effective , thanks to its specificity this solution is also efficient , because it parses each character just once .

只需先删除“,然后拆分。

temp = entry.replaceAll("\"", '').split("\t", 4);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM