java regex get some parts of a string

Question

I'm trying to use Regex in Java for the first time. I want to get some parts of a string. The string is a little complex:

<description>
  &lt;a href='http://testlink.html' alt='some text'&gt;&lt;img border='0'
  src='http://s2.glbimg.com/zzag70iNYX-QK24sUp0YXQmmXhx7yb8j2Sq2YK7tvX3A6vCwEUOFnFTBONQFT-
  ni/s.glbimg.com/es/ge/f/original/2012/04/25/image.jpg' 
  alt='some' title='text' /&gt;&lt;/a&gt;&lt;br /&gt;some text; some text
</description>

My needs is to get the strings that lies in href and alt . For this I'm doing this code:

for(Element element : elements)
{
    //Elements children = element.children();
Pattern pattern = Pattern.compile("a\\bhref=*(.html|.htm)>");
String[] data = pattern.split(element.text()); ...
}

And so on. At the moment I'm trying to get only href without success. The return is always the whole string. Isn't correct? I've put the html extension to guarantee and nothing occurs.

Answer 1

public static void main(String[] args){
  String sourcestring = "<description>&lt;a href='http://testlink.html' alt='some text'&gt;&lt;img border='0' src='http://s2.glbimg.com/zzag70iNYX-QK24sUp0YXQmmXhx7yb8j2Sq2YK7tvX3A6vCwEUOFnFTBONQFT-
ni/s.glbimg.com/es/ge/f/original/2012/04/25/image.jpg' 
alt='some' title='text' /&gt;&lt;/a&gt;&lt;br /&gt;some text; some text</description>";
  Pattern re = Pattern.compile("(?<=href='|alt=')[^']*|(?<=href=\"|alt=\")[^\"]*");
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }

Answer 2

Your regular expression will not be finding things that are useful to you and may even be broken.

The following are true in regular expressions:

* matches 0 or more of the preceding character

. is any character

So your current regex is trying to locate strings that match a pattern where there is an a, a word boundary, the string href, 0 or more = characters, and then any character followed by html or any character followed by htm and then a > character. If you want to use those special characters you will need to escape them

A better way of forming your regex is like Alogomorph's example above.

Please look at the Java documentation for regular expressions for more information on what is allowed: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

There are also plenty of other tutorials and examples available on the web.

Answer 3

Do not use regular expressions for this task, unless you absolutely know that the text format will not change. You seem to want to parse (X|HT)ML using regexps, and that is a bad thing . I'd suggest parsing as XML and using XPath.

java regex get some parts of a string

Question

3 answers

solution1
1 ACCPTED 2012-08-08 20:50:32

solution2
1 2012-08-08 20:54:59

solution3
1 2012-08-08 21:03:14

java regex get some parts of a string

Question

3 answers

solution1 1 ACCPTED 2012-08-08 20:50:32

solution2 1 2012-08-08 20:54:59

solution3 1 2012-08-08 21:03:14

solution1
1 ACCPTED 2012-08-08 20:50:32

solution2
1 2012-08-08 20:54:59

solution3
1 2012-08-08 21:03:14