简体   繁体   English

带制表符的Java模式

[英]java pattern with tab characters

i have a file with lines like: 我有一个像这样的文件:

string1 (tab) sting2 (tab) string3 (tab) string4

I want to get from every line, string3... All i now from the lines is that string3 is between the second and the third tab character. 我想从每一行中获取string3 ...现在从行中获取的所有信息是string3在第二个和第三个制表符之间。 is it possible to take it with a pattern like 是否可以采用类似的模式

Pattern pat = Pattern.compile(".\t.\t.\t.");
String string3 = tempValue.split("\\t")[2];

It sounds like you just want: 听起来您只想要:

foreach (String line in lines) {
    String[] bits = line.split("\t");
    if (bits.length != 4) {
        // Handle appropriately, probably throwing an exception
        // or at least logging and then ignoring the line (using a continue
        // statement)
    }
    String third = bits[2];
    // Use...
}

(You can escape the string so that the regex engine has to parse the backslash-t as tab, but you don't have to. The above works fine.) (您可以转义字符串,以便正则表达式引擎必须将反斜杠-t解析为制表符,但不必这样做。上面的方法很好用。)

Another alternative to the built-in String.split method using a regex is the Guava Splitter class. 使用正则表达式的内置String.split方法的另一种替代方法是Guava Splitter类。 Probably not necessary here, but worth being aware of. 在这里可能没有必要,但是值得注意。

EDIT: As noted in comments, if you're going to repeatedly use the same pattern, it's more efficient to compile a single Pattern and use Pattern.split : 编辑:如注释中所述,如果您要重复使用相同的模式,则编译单个Pattern并使用Pattern.split更有效:

private static final Pattern TAB_SPLITTER = Pattern.compile("\t");

...

String[] bits = TAB_SPLITTER.split(line);

If you want a regex which captures the third field only and nothing else, you could use the following: 如果要使用仅捕获第三个字段而没有其他内容的正则表达式,则可以使用以下命令:

String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
Pattern pattern = Pattern.compile(regex);

Matcher matcher = pattern.matcher(input);
if (matcher.matches()) {
  System.err.println(matcher.group(1));
}

I don't know whether this would perform any better than split("\\\\t") for parsing a large file. 我不知道 这对于解析大文件是否比split("\\\\t")更好。

UPDATE 更新

I was curious to see how the simple split versus the more explicit regex would perform, so I tested three different parser implementations. 我很想知道简单拆分和更明确的正则表达式如何执行,所以我测试了三种不同的解析器实现。

/** Simple split parser */
static class SplitParser implements Parser {
    public String parse(String line) {
        String[] fields = line.split("\\t");
        if (fields.length == 4) {
            return fields[2];
        }
        return null;
    }
}

/** Split parser, but with compiled pattern */
static class CompiledSplitParser implements Parser {
    private static final String regex = "\\t";
    private static final Pattern pattern = Pattern.compile(regex);

    public String parse(String line) {
        String[] fields = pattern.split(line);
        if (fields.length == 4) {
            return fields[2];
        }
        return null;
    }
}

/** Regex group parser */
static class RegexParser implements Parser {
    private static final String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
    private static final Pattern pattern = Pattern.compile(regex);

    public String parse(String line) {
        Matcher m = pattern.matcher(line);
        if (m.matches()) {
            return m.group(1);
        }
        return null;
    }
}

I ran each ten times against the same million line file. 我针对相同的百万行文件运行了十次。 Here are the average results: 这是平均结果:

  • split: 2768.8 ms 分割:2768.8毫秒
  • compiled split: 1041.5 ms 编译后的分割时间:1041.5毫秒
  • group regex: 1015.5 ms 组正则表达式:1015.5毫秒

The clear conclusion is that it is important to compile your pattern , rather than rely on String.split , if you are going to use it repeatedly. 明确的结论是,如果要重复使用模式 ,则编译模式而不是依赖String.split 非常重要

The result on compiled split versus group regex is not conclusive based on this testing. 基于此测试,编译后的拆分与组正则表达式的结果不是结论性的。 And probably the regex could be tweaked further for performance. 正则表达式可能会进一步调整性能。

UPDATE 更新

A further simple optimization is to re-use the Matcher rather than create one per loop iteration. 进一步的简单优化是重用Matcher,而不是在每个循环迭代中创建一个Matcher。

static class RegexParser implements Parser {
    private static final String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
    private static final Pattern pattern = Pattern.compile(regex);

    // Matcher is not thread-safe...
    private Matcher matcher = pattern.matcher("");

    // ... so this method is no-longer thread-safe
    public String parse(String line) {
        matcher = matcher.reset(line);
        if (matcher.matches()) {
            return matcher.group(1);
        }
        return null;
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM