帶制表符的Java模式

Question

我有一個像這樣的文件：

string1 (tab) sting2 (tab) string3 (tab) string4

我想從每一行中獲取string3 ...現在從行中獲取的所有信息是string3在第二個和第三個制表符之間。 是否可以采用類似的模式

Pattern pat = Pattern.compile(".\t.\t.\t.");

Answer 1

String string3 = tempValue.split("\\t")[2];

Answer 2

聽起來您只想要：

foreach (String line in lines) {
    String[] bits = line.split("\t");
    if (bits.length != 4) {
        // Handle appropriately, probably throwing an exception
        // or at least logging and then ignoring the line (using a continue
        // statement)
    }
    String third = bits[2];
    // Use...
}

（您可以轉義字符串，以便正則表達式引擎必須將反斜杠-t解析為制表符，但不必這樣做。上面的方法很好用。）

使用正則表達式的內置String.split方法的另一種替代方法是Guava Splitter類。 在這里可能沒有必要，但是值得注意。

編輯：如注釋中所述，如果您要重復使用相同的模式，則編譯單個Pattern並使用Pattern.split更有效：

private static final Pattern TAB_SPLITTER = Pattern.compile("\t");

...

String[] bits = TAB_SPLITTER.split(line);

Answer 3

如果要使用僅捕獲第三個字段而沒有其他內容的正則表達式，則可以使用以下命令：

String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
Pattern pattern = Pattern.compile(regex);

Matcher matcher = pattern.matcher(input);
if (matcher.matches()) {
  System.err.println(matcher.group(1));
}

~~我不知道~~ 這對於解析大文件是否比split("\\\\t")更好。

更新

我很想知道簡單拆分和更明確的正則表達式如何執行，所以我測試了三種不同的解析器實現。

/** Simple split parser */
static class SplitParser implements Parser {
    public String parse(String line) {
        String[] fields = line.split("\\t");
        if (fields.length == 4) {
            return fields[2];
        }
        return null;
    }
}

/** Split parser, but with compiled pattern */
static class CompiledSplitParser implements Parser {
    private static final String regex = "\\t";
    private static final Pattern pattern = Pattern.compile(regex);

    public String parse(String line) {
        String[] fields = pattern.split(line);
        if (fields.length == 4) {
            return fields[2];
        }
        return null;
    }
}

/** Regex group parser */
static class RegexParser implements Parser {
    private static final String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
    private static final Pattern pattern = Pattern.compile(regex);

    public String parse(String line) {
        Matcher m = pattern.matcher(line);
        if (m.matches()) {
            return m.group(1);
        }
        return null;
    }
}

我針對相同的百萬行文件運行了十次。 這是平均結果：

分割：2768.8毫秒
編譯后的分割時間：1041.5毫秒
組正則表達式：1015.5毫秒

明確的結論是，如果要重復使用模式，則編譯模式而不是依賴String.split 非常重要。

基於此測試，編譯后的拆分與組正則表達式的結果不是結論性的。 正則表達式可能會進一步調整性能。

更新

進一步的簡單優化是重用Matcher，而不是在每個循環迭代中創建一個Matcher。

static class RegexParser implements Parser {
    private static final String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
    private static final Pattern pattern = Pattern.compile(regex);

    // Matcher is not thread-safe...
    private Matcher matcher = pattern.matcher("");

    // ... so this method is no-longer thread-safe
    public String parse(String line) {
        matcher = matcher.reset(line);
        if (matcher.matches()) {
            return matcher.group(1);
        }
        return null;
    }
}

帶制表符的Java模式

問題描述

3 個解決方案

解決方案1
6 2011-11-22 12:49:07

解決方案2
5 2011-11-22 12:49:17

解決方案3
3 已采納 2011-11-22 13:26:39

帶制表符的Java模式

問題描述

3 個解決方案

解決方案1 6 2011-11-22 12:49:07

解決方案2 5 2011-11-22 12:49:17

解決方案3 3 已采納 2011-11-22 13:26:39

解決方案1
6 2011-11-22 12:49:07

解決方案2
5 2011-11-22 12:49:17

解決方案3
3 已采納 2011-11-22 13:26:39