简体   繁体   English

查找包含在分隔符中的String的一部分

[英]Finding the part of a String that is wrapped in delimeters

Say I have a String like this: 说我有这样的String

String s="social network such as '''[http://www.facebook.com Facebook]''' , "+
"'''[http://www.twitter.com Twitter]''' and '''[http://www.tumblr.com tumblr]'''";

and I need to retrieve only those Strings within '''[ and ]''' . 我需要只检索'''[]'''那些Strings

example output: 示例输出:

http://www.facebook.com Facebook, http://www.twitter.com Twitter, http://www.tumblr.com   tumblr

I'm having a difficulty doing this using regex , so I came with this idea using recursion : 使用regex时我很难做到这一点,所以我使用recursion来实现这个想法:

System.out.println(filter(s, "'''[",  "]'''"));
....

public static String filter(String s, String open, String close){   
  int start = s.indexOf(open);
  int end = s.indexOf(close);

  filtered = filtered + s.substring(start + open.length(), end) + ", ";
  s = s.substring(end + close.length(), s.length());

  if(s.indexOf(open) >= 0 && s.indexOf(close) >= 0)
     return filter(s, open, close);

  else
     return filtered.substring(0, filtered.length() - 2);
}

but in some case, where I need to retrieve words within the same pattern of the String such as within ''' and ''' , it will say String index out of range because start and end will hold the same value. 但在某些情况下,我需要在String的相同模式中检索单词,例如在'''''' ,它会说String索引超出范围,因为startend将保持相同的值。

How can I overcome this? 我怎么能克服这个? Is regex the only solution? regex是唯一的解决方案吗?

Regex is the right tool for this. 正则表达式是这个的正确工具。 Use Pattern and Matcher . 使用PatternMatcher

public static String filter(String s, String open, String close){
    Pattern p = Pattern.compile(Pattern.quote(open) + "(.*?)" + Pattern.quote(close));
    Matcher m = p.matcher(s);

    StringBuilder filtered = new StringBuilder();

    while (m.find()){
        filtered.append(m.group(1)).append(", ");
    }
    return filtered.substring(0, filtered.length() - 2); //-2 because trailing ", "
}

Pattern.quote ensures that any special characters for open and close are treated as regular ones. Pattern.quote确保将openclose任何特殊字符视为常规字符。

m.group() returns the group from the last String matched by m.find() . m.group()m.find()匹配的最后一个String返回该组。

m.find() finds all substrings that match the regex. m.find()查找与正则表达式匹配的所有子字符串。


Non-regex Solutions: 非正则表达式解决方案:

Note: in both of these, end is assigned s.indexOf(close, start + 1) , using String#indexOf(String, int) and StringBuilder#indexOf(String, int) so that even if the open and close values are the same, no error occurs. 注意:在这两个中, end使用String#indexOf(String, int)StringBuilder#indexOf(String, int)分配s.indexOf(close, start + 1) ,这样即使openclose值是同样,不会发生错误。

Recursion : 递归

public static String filter(String s, String open, String close){
    int start = s.indexOf(open);
    int end = s.indexOf(close, start + 1);

    //I took the liberty of adding "String" and renaming your variable
    String get = s.substring(start + open.length(), end);
    s = s.substring(end + close.length());

    if (s.indexOf(open) == -1){
        return get;
    }
    return get + ", " + filter(s, open, close);
}

Rather than adding the ", " right off the bat, it is a little easier to deal with it later. 不是直接添加", " ,而是稍后处理它会更容易一些。 Also, note that s.substring(end + close.length(), s.length()) is the same as s.substring(end + close.length()); 另请注意, s.substring(end + close.length(), s.length())s.substring(end + close.length()); Also, I feel that it is neater to see if s.indexOf(...) == -1 rather than checking for >=0 . 另外,我觉得查看s.indexOf(...) == -1而不是检查>=0是否s.indexOf(...) == -1简洁。

The real problem lies in the way you treat filtered . 真正的问题在于您对待filtered的方式。 First of all, you need to declare filtered as type String . 首先,你需要声明filtered类型String Next, since you are doing recursion, you shouldn't concatenate to filtered . 接下来,由于您正在进行递归,因此不应将其连接到已filtered That would make the line where we first see filtered : String filtered = s.substring(start + open.length(), end) + ", "; 这将使我们第一次看到的行被filteredString filtered = s.substring(start + open.length(), end) + ", "; . If you fix that line, your solution works. 如果您修复该行,您的解决方案将起作用。

Iterative : 迭代

public static String filter(String str, String open, String close){
    int open_length = open.length();
    int close_length = close.length();

    StringBuilder s = new StringBuilder(str);
    StringBuilder filtered = new StringBuilder();

    for (int start = s.indexOf(open), end = s.indexOf(close, start + 1); start != -1; 
        start = s.indexOf(open), end = s.indexOf(close, start + 1)){
        filtered.append(s.substring(start + open_length, end)).append(", ");
        s.delete(0, end + close_length);
    }

    return filtered.substring(0, filtered.length() - 2); //trailing ", "
}

This iterative method makes use of StringBuilder , but the same can be done without it. 这个迭代方法使用了StringBuilder ,但没有它也可以这样做。 It makes two StringBuilder s, one empty one, and one that holds the value of the original String . 它生成两个StringBuilder ,一个空的,一个保存原始String的值。 In the for loop: for循环中:

  • int start = s.indexOf(open), end = s.indexOf(close) gets a reference to the indices int start = s.indexOf(open), end = s.indexOf(close)获取对索引的引用
  • start != -1 ends the loop if s does not contain open 如果s不包含openstart != -1结束循环
  • start = s.indexOf(open), end = s.indexOf(close) after each iteration of the loop, find the indices again. 在循环的每次迭代之后start = s.indexOf(open), end = s.indexOf(close) ,再次找到索引。

The inside of the loop appends the correct substring to finished and removes the appended part from the other StringBuilder . 循环内部将正确的子字符串附加到finished并从其他StringBuilder删除附加的部分。

Never mind all that code in other answers... You can do it in one line: 别介意其他答案中的所有代码......您可以在一行中完成:

String[] urls = str.replaceAll("^.*?'''\\[|\\]'''(?!.*\\]''').*", "").split("\\]'''.*?'''\\[");

This first strips off the leading and trailing jetsam and then splits on a delimiter that matches everything between the targets. 首先剥离前导和尾随jetsam,然后在分隔符上分割,该分隔符匹配目标之间的所有内容。


This can be adapted to a flexible solution that has variable delimiters: 这可以适用于具有可变分隔符的灵活解决方案:

public static String[] extract(String str, String open, String close) {
    return str.replaceAll("^.*?(\\Q" + open + "\\E|$)|\\Q" + close + "\\E(?!.*\\Q" + close + "\\E).*", "").split("\\Q" + close + "\\E.*?\\Q" + open + "\\E");
}

This regex also caters for there being no targets by returning an array with a single blank element. 此正则表达式还通过返回具有单个空白元素的数组来满足没有目标的要求。

PS this is the first time I can recall using the quote syntax \\Q...\\E to treat characters in the regex as literals, so I'm chuffed about that. PS这是我第一次回忆起使用引用语法\\Q...\\E将正则表达式中的字符视为文字,所以我对此感到不满。

I would also like to claim some bragging rights for typing the whole thing on my iPhone (note that means there could be a character or two out of place, but it should be pretty close). 我还要声称在我的iPhone上输入整个东西的一些吹牛的权利(请注意,这意味着可能有一两个字符不合适,但它应该非常接近)。

You can use the string tokenizer for this very easily. 您可以非常轻松地使用字符串标记器。 Simply hand the whole string to the tokenizer then ask for each token and check if it begins with your delimiter. 只需将整个字符串传递给tokenizer,然后询问每个令牌并检查它是否以您的分隔符开头。 If it does, extract the contents into your results collection. 如果是,请将内容提取到结果集合中。

The string tokenizer version will be less upped and not as ugly as the regent solution. 字符串标记器版本将不那么高,而不是像摄政解决方案那样难看。

Here is the tokenizer version: 这是tokenizer版本:

public class TokenizerTest {

    @Test
    public void canExtractNamesFromTokens(){
        String openDelimiter = "'''[";
        String closeDelimiter = "]'''";
        String s="social network such as '''[http://www.facebook.com Facebook]''' , "+
            "'''[http://www.twitter.com Twitter]''' and '''[http://www.tumblr.com tumblr]'''";

        StringTokenizer t = new StringTokenizer(s);

        while (t.hasMoreElements()){
            String token = t.nextToken();
            if (token.startsWith(openDelimiter)){
                String url = token.substring(openDelimiter.length());
                token = t.nextToken();
                String siteName = token.substring(0, token.length()-closeDelimiter.length());
                System.out.println(url + " " + siteName);
            }
        }
   }
}

Not sure how this could get any simpler or cleaner. 不知道这怎么会变得更简单或更清洁。 Absolutely clear what the code is doing. 绝对清楚代码在做什么。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM