简体   繁体   English

使用正则表达式解析文本

[英]Parsing text using Regex

So I am trying to parse a String that contains two key components. 所以我试图解析一个包含两个关键组成部分的字符串。 One tells me the timing options, and the other is position. 一个告诉我时间选择,另一个告诉我位置。

Here is what the text looks like 这是文本的样子

KB_H9Oct4GFP_20130305_p00{iiii}t00000{ttt}z001c02.tif

The {iiii} is the position and the {ttt} is the timing options. {iiii}是头寸, {ttt}是时间选项。

I need to separate the {ttt} and {iiii} out so I can get a full file name: example, position 1 and time slice 1 = KB_H9Oct4GFP_20130305_p0000001t000000001z001c02.tif 我需要将{ttt}{iiii}分开,以便获得完整的文件名:例如,位置1和时间片1 = KB_H9Oct4GFP_20130305_p0000001t000000001z001c02.tif

So far here is how I am parsing them: 到目前为止,这里是我解析它们的方式:

    int startTimeSlice = 1;
    int startTile = 1;
    String regexTime = "([^{]*)\\{([t]+)\\}(.*)";
    Pattern patternTime = Pattern.compile(regexTime);       
    Matcher matcherTime = patternTime.matcher(filePattern);

    if (!matcherTime.find() || matcherTime.groupCount() != 3)
    {

        throw new IllegalArgumentException("Incorect filePattern: " + filePattern);
    }

    String timePrefix = matcherTime.group(1);
    int tCount = matcherTime.group(2).length();
    String timeSuffix = matcherTime.group(3);

    String timeMatcher = timePrefix + "%0" + tCount + "d" + timeSuffix;


    String timeFileName = String.format(timeMatcher, startTimeSlice);

    String regex = "([^{]*)\\{([i]+)\\}(.*)";
    Pattern pattern = Pattern.compile(regex);       
    Matcher matcher = pattern.matcher(timeFileName);        



    if (!matcher.find() || matcher.groupCount() != 3)
    {
        throw new IllegalArgumentException("Incorect filePattern: " + filePattern);
    }

    String prefix = matcher.group(1);
    int iCount = matcher.group(2).length();
    String suffix = matcher.group(3);

    String nameMatcher = prefix + "%0" + iCount + "d" + suffix;

    String fileName = String.format(nameMatcher, startTile);

Unfortunately my code is not working and it fails when checking if the second matcher finds anything in timeFileName . 不幸的是,我的代码无法正常工作,并且在检查第二个matcher是否在timeFileName找到任何内容时timeFileName

After the first regex check it gets the following as the timeFileName : 000000001z001c02.tif , so it is cutting off the beginning potions including the {iiii} 在进行第一次正则表达式检查后,它得到以下内容作为timeFileName000000001z001c02.tif ,因此它将切断包括{iiii}在内的开头部分。

Unfortunately I cannot assuming which group goes first ( {iiii} or {ttt} ), so I am trying to devise a solution that just handles {ttt} first and then processes {iiii} . 不幸的是,我不能假设哪个组先进入( {iiii}{ttt} ),所以我试图设计一个解决方案,该解决方案首先处理{ttt} ,然后处理{iiii}

Also, here is another example of valid text that I am also trying to parse: F_{iii}_{ttt}.tif 另外,这是我也在尝试解析的有效文本的另一个示例: F_{iii}_{ttt}.tif

Steps to follow: 遵循的步骤:

  • Find string {ttt...} in file name 在文件名中找到字符串{ttt ...}
  • Form a number format based on no of "t" in string 根据字符串中“ t”的编号形成数字格式
  • Find string {iiii...} in file name 在文件名中找到字符串{iiii ...}
  • Form a number format based on no of "i" in string 根据字符串中“ i”的编号形成数字格式
  • Use String.replace() method to replace time and possition 使用String.replace()方法替换时间和位置

Here is the code: 这是代码:

String filePattern = "KB_H9Oct4GFP_20130305_p00{iiii}t00000{ttt}z001c02.tif";
int startTimeSlice = 1;
int startTile = 1;

Pattern patternTime = Pattern.compile("(\\{[t]*\\})");
Matcher matcherTime = patternTime.matcher(filePattern);

if (matcherTime.find()) {
    String timePattern = matcherTime.group(0);// {ttt}

    NumberFormat timingFormat = new DecimalFormat(timePattern.replaceAll("t", "0")
            .substring(1, timePattern.length() - 1));// 000

    Pattern patternPosition = Pattern.compile("(\\{[i]*\\})");
    Matcher matcherPosition = patternPosition.matcher(filePattern);

    if (matcherPosition.find()) {
        String positionPattern = matcherPosition.group(0);// {iiii}

        NumberFormat positionFormat = new DecimalFormat(positionPattern
                .replaceAll("i", "0").substring(1, positionPattern.length() - 1));// 0000

        System.out.println(filePattern.replace(timePattern,
                timingFormat.format(startTimeSlice)).replace(positionPattern,
                positionFormat.format(startTile)));
    }
}

Your first pattern looks like this: 您的第一个模式如下所示:

String regexTime = "([^{]*)\\{([t]+)\\}(.*)";

This finds a string consisting of a sequence of zero or more non- { characters, followed by {t...t} , followed by other characters. 这将找到一个字符串,该字符串由零个或多个非{字符组成,然后由{t...t} ,然后是其他字符组成。

When your input is 当您输入

KB_H9Oct4GFP_20130305_p00{iiii}t00000{ttt}z001c02.tif

the first substring that matches is 匹配的第一个子字符串是

iiii}t00000{ttt}z001c02.tif

The { before the i's can't match, because you told it only to match non- { characters. i之前的{无法匹配,因为您告诉它只能匹配非{字符。 The result is that when you re-form the string to do the second match, it will start with iiii} and therefore won't match {iiii} like you're trying to do. 结果是,当您重新iiii}字符串以进行第二次匹配时,它将以iiii}开头,因此不会像您尝试的那样匹配{iiii}

When you're looking for {ttt...} , I don't see any reason to exclude { or any other character from the first part of the string. 当您寻找{ttt...} ,我看不出有任何理由从字符串的第一部分中排除{或其他任何字符。 So changing the regex to 因此将正则表达式更改为

"^(.*)\\{(t+\\}(.*)$"

may be a simple way to fix this. 可能是解决此问题的简单方法。 Note that if you want to make sure you include the entire beginning of the string and the entire end of the string in your groups, you should include ^ and $ to match the beginning and end of the string, respectively; 注意,如果要确保在组中包括字符串的整个开头和字符串的整个结尾,则应包括^$以分别匹配字符串的开头和结尾; otherwise the matcher engine may decide not to include everything. 否则,匹配器引擎可能会决定不包括所有内容。 In this case, it won't, but it's a good habit to get into anyway, because that makes things explicit and doesn't require anyone to know the difference between "greedy" and "reluctant" matching. 在这种情况下,它不会,但是无论如何都是一个好习惯,因为这使事情变得很明确,并且不需要任何人知道“贪婪”和“勉强”匹配之间的区别。 Or use matches() instead of find() , since matches() automatically tries to match the entire string. 或者使用matches()而不是find() ,因为matches()自动尝试匹配整个字符串。

Okay, so after a bit of testing I found a way to handle the case: 好的,因此,经过一些测试,我找到了一种处理这种情况的方法:

For parsing the {ttt} I can use the regex: (.*)\\\\{t([t]+)\\\\}(.*) 为了解析{ttt}我可以使用正则表达式: (.*)\\\\{t([t]+)\\\\}(.*)

Now this means I have to increment tCount by one to account for the t I grab from \\\\{t 现在,这意味着我必须将tCount加1才能说明从\\\\{t

Same goes for {iii} : (.*)\\\\{i([i]+)\\\\}(.*) {iii}(.*)\\\\{i([i]+)\\\\}(.*)

Perhaps an easier way to do this (as confirmed by http://regex101.com/r/vG7kY7 ) is 也许更简单的方法(如http://regex101.com/r/vG7kY7所确认)是

(\{i+\}).*(\{t+\})

You don't need the [] around a single character you are matching. 您不需要在要匹配的单个字符周围使用[] Keep it simple. 把事情简单化。 i+ means "one or more i 's", and as long as these are in the order given, this expression will work (with the first match being {iiii} and the second {ttttt} ). i+表示“一个或多个i ”,只要按给定的顺序进行,该表达式即可工作(第一个匹配项为{iiii} ,第二个匹配项为{ttttt} )。

You may need to escape the backslash when writing it in a string... 在字符串中编写时,可能需要转义反斜杠...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM