[英]java pattern with tab characters
i have a file with lines like: 我有一个像这样的文件:
string1 (tab) sting2 (tab) string3 (tab) string4
I want to get from every line, string3... All i now from the lines is that string3 is between the second and the third tab character. 我想从每一行中获取string3 ...现在从行中获取的所有信息是string3在第二个和第三个制表符之间。 is it possible to take it with a pattern like
是否可以采用类似的模式
Pattern pat = Pattern.compile(".\t.\t.\t.");
String string3 = tempValue.split("\\t")[2];
It sounds like you just want: 听起来您只想要:
foreach (String line in lines) {
String[] bits = line.split("\t");
if (bits.length != 4) {
// Handle appropriately, probably throwing an exception
// or at least logging and then ignoring the line (using a continue
// statement)
}
String third = bits[2];
// Use...
}
(You can escape the string so that the regex engine has to parse the backslash-t as tab, but you don't have to. The above works fine.) (您可以转义字符串,以便正则表达式引擎必须将反斜杠-t解析为制表符,但不必这样做。上面的方法很好用。)
Another alternative to the built-in String.split
method using a regex is the Guava Splitter
class. 使用正则表达式的内置
String.split
方法的另一种替代方法是Guava Splitter
类。 Probably not necessary here, but worth being aware of. 在这里可能没有必要,但是值得注意。
EDIT: As noted in comments, if you're going to repeatedly use the same pattern, it's more efficient to compile a single Pattern
and use Pattern.split
: 编辑:如注释中所述,如果您要重复使用相同的模式,则编译单个
Pattern
并使用Pattern.split
更有效:
private static final Pattern TAB_SPLITTER = Pattern.compile("\t");
...
String[] bits = TAB_SPLITTER.split(line);
If you want a regex which captures the third field only and nothing else, you could use the following: 如果要使用仅捕获第三个字段而没有其他内容的正则表达式,则可以使用以下命令:
String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
if (matcher.matches()) {
System.err.println(matcher.group(1));
}
I don't know
whether this would perform any better than split("\\\\t")
for parsing a large file.
我不知道
这对于解析大文件是否比split("\\\\t")
更好。
UPDATE 更新
I was curious to see how the simple split versus the more explicit regex would perform, so I tested three different parser implementations. 我很想知道简单拆分和更明确的正则表达式如何执行,所以我测试了三种不同的解析器实现。
/** Simple split parser */
static class SplitParser implements Parser {
public String parse(String line) {
String[] fields = line.split("\\t");
if (fields.length == 4) {
return fields[2];
}
return null;
}
}
/** Split parser, but with compiled pattern */
static class CompiledSplitParser implements Parser {
private static final String regex = "\\t";
private static final Pattern pattern = Pattern.compile(regex);
public String parse(String line) {
String[] fields = pattern.split(line);
if (fields.length == 4) {
return fields[2];
}
return null;
}
}
/** Regex group parser */
static class RegexParser implements Parser {
private static final String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
private static final Pattern pattern = Pattern.compile(regex);
public String parse(String line) {
Matcher m = pattern.matcher(line);
if (m.matches()) {
return m.group(1);
}
return null;
}
}
I ran each ten times against the same million line file. 我针对相同的百万行文件运行了十次。 Here are the average results:
这是平均结果:
The clear conclusion is that it is important to compile your pattern , rather than rely on String.split , if you are going to use it repeatedly. 明确的结论是,如果要重复使用模式 ,则编译模式而不是依赖String.split 非常重要 。
The result on compiled split versus group regex is not conclusive based on this testing. 基于此测试,编译后的拆分与组正则表达式的结果不是结论性的。 And probably the regex could be tweaked further for performance.
正则表达式可能会进一步调整性能。
UPDATE 更新
A further simple optimization is to re-use the Matcher rather than create one per loop iteration. 进一步的简单优化是重用Matcher,而不是在每个循环迭代中创建一个Matcher。
static class RegexParser implements Parser {
private static final String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
private static final Pattern pattern = Pattern.compile(regex);
// Matcher is not thread-safe...
private Matcher matcher = pattern.matcher("");
// ... so this method is no-longer thread-safe
public String parse(String line) {
matcher = matcher.reset(line);
if (matcher.matches()) {
return matcher.group(1);
}
return null;
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.