简体   繁体   English

使用String.split()一个接一个地添加2个分隔符时如何添加空字符串

[英]How to add empty string when 2 delimiters one after another with String.split()

I'm quite new to regex and I have to split EDI files for a loader I'm developing.我对正则表达式很陌生,我必须为我正在开发的加载程序拆分 EDI 文件。 If you are not familiar with it, here is an example of 2 segments (modified to explain all so it's not a real example):如果您不熟悉它,这里是 2 个片段的示例(已修改以解释所有内容,因此它不是一个真实示例):

APD+EM2:0:16?'30::6+++++++DA'APD+EM2:0:1630::6+++++++DA'

End of lines are marked with ' and I ignore if there's an escaping char which is the question mark - ?'行尾标有'如果有一个 escaping 字符,我会忽略它是问号 - ?' is to ignore for example for the end of a line.是忽略例如一行的结尾。 \+ and : are the main delimiters (when data are composite like an address). \+:是主要的分隔符(当数据像地址一样组合时)。

The split for the segments works fine, but I have issues with the other delimiters.段的拆分工作正常,但我对其他分隔符有疑问。 I would like to have a String[] with all the elements, even if they are empty, because I need to process it after (insert in DB).我想要一个包含所有元素的String[] ,即使它们是空的,因为我需要在之后处理它(插入 DB)。 With the example above, I would like to have a tab like this:对于上面的示例,我希望有一个这样的选项卡:

APD+EM2:0:16?'30::6+++++++DA

would transform into:将转变为:

{"APD","EM2","0","16?'30","","6","","","","","","","DA"}

Currently with my code, I get a tab like this:目前使用我的代码,我得到一个像这样的标签:

{"APD","EM2","0","16?'30","6","DA"}

Can I please have some help with my regex?我可以对我的正则表达式有一些帮助吗? Making it match ++ and :: is beyond my skills for now.让它匹配++::目前超出了我的技能。 I need to remove the escaping characters as well, but I'll work on that on my own.我还需要删除 escaping 字符,但我会自己处理。

BTW, I need to process a lot of data - 300gb of raw text - so if what I do is bad performance-wise, don't hesitate to tell me - like per example split with both + and : at the same time.顺便说一句,我需要处理大量数据 - 300gb 的原始文本 - 所以如果我所做的在性能方面不好,请不要犹豫告诉我 - 就像每个示例同时使用+:一样。

The EDIFACT format is not something discussed a lot around here, and the few examples I found were not working for me. EDIFACT 格式在这里讨论得并不多,我发现的几个例子对我不起作用。

Current code:当前代码:

private final String DATA_ELEMENT_DELIMITER = "(?<!\\?)\\+";
private final String DATA_COMPOSITE_ELEMENT_DELIMITER = "(?<!\\?):";

private String[] split (String segments){       
    return Stream.of(segments)
            .flatMap(Pattern.compile(DATA_ELEMENT_DELIMITER)::splitAsStream)
            .flatMap(Pattern.compile(DATA_COMPOSITE_ELEMENT_DELIMITER)::splitAsStream)
            .toArray(String[]::new);
}

EDIT: The code I'm running - BTW, I'm running on Java 8, not sure it makes a difference though:编辑:我正在运行的代码 - 顺便说一句,我在 Java 8 上运行,但不确定它是否有所作为:

import java.util.Arrays;
import java.util.regex.Pattern;
import java.util.stream.Stream;
public class Split {

    public static void main(String[] args) {
        Split s = new Split();
        System.out.println(
                Arrays.toString(
                    s.split("APD+EM2:0:16?'30::6+++++++DA'")
                )
            );
    }
    
    
    private static final Pattern DATA_ELEMENT_DELIMITER = Pattern.compile("(?<!\\?)\\+");
    private static final Pattern DATA_COMPOSITE_ELEMENT_DELIMITER = Pattern.compile("(?<!\\?):");
    
    private String[] split (String segments){       
        return Stream.of(segments)
                .flatMap(DATA_ELEMENT_DELIMITER::splitAsStream)
                .flatMap(DATA_COMPOSITE_ELEMENT_DELIMITER::splitAsStream)
                .toArray(String[]::new);
    }
}

Here is the output i get:这是我得到的 output:

[APD, EM2, 0, 16?'30, , 6, DA']

EDIT EDIT编辑编辑

After trying to run this code in an online Java 11 compiler, the output is correct, but not on Java 8.尝试在在线 Java 11 编译器中运行此代码后,output 是正确的,但在 Java 8 上不正确。

My first note is that for improved performance, you definitely want to compile the Pattern s once and reuse the instance:我的第一个注意事项是,为了提高性能,您肯定希望编译一次Pattern用该实例:

private static final Pattern DATA_ELEMENT_DELIMITER = Pattern.compile("(?<!\\?)\\+");
private static final Pattern DATA_COMPOSITE_ELEMENT_DELIMITER = Pattern.compile("(?<!\\?):");
// ...
.flatMap(DATA_ELEMENT_DELIMITER::splitAsStream)
.flatMap(DATA_COMPOSITE_ELEMENT_DELIMITER::splitAsStream)

Second, as @user15244370 mentioned, running your code does produce the output you are looking for.其次,正如@user15244370 提到的,运行您的代码确实会产生您正在寻找的 output。 I ran it like this:我是这样运行的:

System.out.println(
    Arrays.toString(
        split("APD+EM2:0:16?'30::6+++++++DA'APD+EM2:0:1630::6+++++++DA'")
    )
);

and got the output:并得到了 output:

[APD, EM2, 0, 16?'30, , 6, , , , , , , DA'APD, EM2, 0, 1630, , 6, , , , , , , DA']

Assuming there is some difference between what you have posted and what you are actually running, the documentation for splitAsStream mentions:假设您发布的内容与实际运行的内容之间存在一些差异, splitAsStream的文档提到:

Trailing empty strings will be discarded and not encountered in the stream.在 stream 中将丢弃尾随空字符串并且不会遇到。


Are you doing any additional processing after the call to split ?您在调用split后是否进行任何其他处理? And how are you printing the array?你是如何打印数组的? Is it possible that the method you are using to print the string[] may be removing empty strings?您用于打印string[]的方法是否可能正在删除空字符串? As far as I can tell, your implementation should function as you intend.据我所知,您的实现应该如您所愿 function 。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM