简体   繁体   English

Java:解析带有额外引号的字符串

[英]Java: Parsing Strings with extra quotes

Alternate problem title: Splitting a comma delimited list that is inside of a tab delimited list. 备用问题标题:分割制表符分隔列表内的逗号分隔列表。

I'm looking for a solution that does not involve other packages other than standard java routines. 我正在寻找不涉及标准Java例程以外的其他程序包的解决方案。 This has got to be something that has been solved before, I just don't know which keywords to use on Stackoverflow to find it! 这必须是之前已经解决的问题,我只是不知道在Stackoverflow上使用哪个关键字来找到它!

I have a tab delimited file that I am parsing. 我正在解析一个制表符分隔的文件。 I perform error checking on the fields after splitting the line to prevent bad data getting into my program. 分割行后,我在字段上执行错误检查,以防止不良数据进入程序。 I pretty much have everything solved except for one field. 除了一个领域,我几乎解决了所有问题。 The basic layout of the input line is: 输入线的基本布局是:

field1<tab>field2<tab>field3<tab>field4

field3, by design can contain: 根据设计,field3可以包含:

  1. Empty string: 空字符串:

     field1<tab>field2<tab><tab>field4 
  2. One string, with or without blanks: 一串,带或不带空格:

     field1<tab>field2<tab>Fred Flintstone<tab>field4 
  3. Multiple strings separated by commas: 用逗号分隔的多个字符串:

     field1<tab>field2<tab>Fred, Barney, Wilma<tab>field4 

The line is read and split as follows: 该行的读取和拆分如下:

    String entry = pq2File.readline();
    String[] temp;
    temp = entry.split("\t", 4);

When I split the input line by "\\t" my third field (temp[2]) is set equal as follows in each of the cases above: 当我将输入行除以“ \\ t”时,在上述每种情况下,我的第三个字段(temp [2])的设置如下:

  1. [] []
  2. [Fred Flintstone] [Fred Flintstone]
  3. [Fred, Barney, Wilma] [弗雷德,巴尼,威尔玛]

I then split field3 again by "," 然后,我再次用“,”分割field3

ArrayList<String> names = 
     new ArrayList<String>(Arrays.asList(temp[2].split(",")));

giving me the following values in the ArrayList names, in each of the cases above 在上述每种情况下,在ArrayList名称中给我以下值

  1. [empty] [空]
  2. Fred Flintstone 弗雷德·弗林特斯通
  3. Fred 弗雷德
    Barney 巴尼
    Wilma 威尔玛

All this is handled correctly when I use text editor to create the file, or SQL statements to pull the data out of an external, remote system to which I do not have access. 当我使用文本编辑器创建文件或使用SQL语句将数据从我无法访问的外部远程系统中拉出时,所有这些操作都可以正确处理。 The problem comes in with a user that insists on using MS EXCEL to create the file. 用户坚持使用MS EXCEL创建文件会带来问题。 In this case the line looks like this: 在这种情况下,该行如下所示:

field1<tab>field2<tab>"Fred, Barney, Wilma"<tab>field4

When I parse the line, my variable gets the value 当我解析行时,我的变量获取值

"Fred, Barney, Wilma"

And splitting it by "," results in: 并以“,”将其分割为:
"Fred “弗雷德
Barney 巴尼
Wilma" 威尔玛”

Obviously I want to get rid the extra " marks. Am I looking for a solution to remove the " marks before I split the field? 显然,我想摆脱多余的“”标记。我是否正在寻找解决方案,以便在拆分字段之前删除“”标记? or does it make more sense (less code) to wait until after the field is split, and then just look at the first and last items. 还是等到字段拆分之后才有意义(更少的代码),然后再看第一和最后一项。 I ask because it is possible that the line could be: 我问,因为这行可能是:

field1<tab>field2<tab>"Fred Flintstone", "Barney Rubble", "Wilma Flintstone"<tab>field4 

In this case I would expect temp[2] to become: 在这种情况下,我希望temp [2]变为:

"Fred Flintstone", "Barney Rubble", "Wilma Flintstone"

and the resulting split of temp[2] should result in: 并且temp [2]的结果拆分应导致:
"Fred Flintstone" “弗雷德摩登原始人”
"Barney Rubble" “巴尼·鲁伯”
"Wilma Flintstone" “威尔玛打火石”

which would be fine. 很好。

Edit The design team has been consulted and confirmed that for ALL fields, there can be no embedded tabs within the fields. 编辑已咨询设计团队,并确认对于所有字段,这些字段中都没有嵌入的选项卡。

Further, they have confirmed that within field 3, there can be no embedded commas with an item within the field. 此外,他们已经确认,在字段3中,该字段内没有带逗号的项目。

therefore, input such as: 因此,输入如下内容:

field1<tab>field2<tab>"Fred Flintstone", "Barney, Wilma"<tab>field4 

should result in three entries for field3: 应为field3产生三个条目:

  • "Fred Flintstone" “弗雷德摩登原始人”
  • "Barney “巴尼
  • Wilam" Wilam”

I am pressing them on another issue that may make this whole issue moot... 我正向他们施加压力,要求他们解决可能使整个问题困扰的另一个问题。

I think you want to 你想

  • Split by comma 以逗号分隔
  • If ((first element starts with double-quote but does not end with double-_quote) and (last element ends with double-quote but does not start with double-quote)) then remove those doble-quotes 如果((第一个元素以双引号开头但不以double__quote结束)和(最后一个元素以双引号结尾但不以双引号开头)),则删除那些双引号

Still, I am wondering if there can be bad data, like 不过,我想知道是否会有坏数据,例如

field1<tab>field2<tab>"Fred Flintstone", "Barney, Wilma"<tab>field4 

Resulting in all kinds of dirty data. 导致各种脏数据。 You might want rigorously define the grammar instead of using examples, at which point the parsing should become trivial. 您可能需要严格定义语法,而不是使用示例,此时解析应该变得很简单。

I recommend you to code a specific parser in two levels: 我建议您分两个级别对特定的解析器进行编码:

  • The outer level should stop at every occurrence of TAB. 外层应在每次出现TAB时停止。
  • The inner level should stop at every occurrence of comma, and discard first character quote and last character quote. 内部级别应在每次出现逗号时停止,并丢弃第一个字符引号和最后一个字符引号。

And not to sound so theoretically, I post here my proposal: 从理论上讲,我并没有在这里发表我的建议:

public class CombinedStringParser
{
    private final String src;

    private final char delimitter;

    private int currentPos=0;

    public CombinedStringParser(String src, char delimitter)
    {
        super();
        this.src=src;
        this.delimitter=delimitter;
    }

    public String nextToken()
    {
        int initialPos=this.currentPos;
        int x=0;
        while (this.currentPos < this.src.length())
        {
            char c=this.src.charAt(this.currentPos++);
            if (c == this.delimitter)
            {
                x=-1;
                break;
            }
        }
        return this.src.substring(initialPos, this.currentPos + x);
    }

    public List<String> nextListOfTokens(char listDelimitter)
    {
        int initialPos=this.currentPos;
        List<String> list=new ArrayList<String>();
        while (this.currentPos < this.src.length())
        {
            char c=this.src.charAt(this.currentPos++);
            if (c == this.delimitter)
            {
                break;
            }
            else
            {
                if (c == listDelimitter)
                {
                    int p1=initialPos;
                    int p2=this.currentPos - 1;
                    if (this.src.charAt(p1) == '\"')
                    {
                        p1++;
                    }
                    if (this.src.charAt(p2 - 1) == '\"')
                    {
                        p2--;
                    }
                    list.add(this.src.substring(p1, p2));
                    initialPos=this.currentPos;
                }
            }
        }
        if (initialPos < this.currentPos)
        {
            int p1=initialPos;
            int p2=this.src.length();
            if (this.src.charAt(p1) == '\"')
            {
                p1++;
            }
            if (this.src.charAt(p2 - 1) == '\"')
            {
                p2--;
            }
            list.add(this.src.substring(p1, p2));
        }
        return list;
    }
}

How to use it: 如何使用它:

CombinedStringParser parser=new CombinedStringParser(src, '\t');
String firstToken=parser.nextToken();
String secondToken=parser.nextToken();
List<String> thirdToken=parser.nextListOfTokens(',');
String fourthToken=parser.nextToken();

Appart from being effective , thanks to its specificity this solution is also efficient , because it parses each character just once . Appart从有效开始 ,由于其特殊性,该解决方案也是有效的 ,因为它只解析每个字符一次

只需先删除“,然后拆分。

temp = entry.replaceAll("\"", '').split("\t", 4);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM