简体   繁体   中英

Java regex to match specific url in a html string

I need to get the value between href's double quotes(") that matches a specific pattern, I tried the above but I can't figure out what's wrong. When I find the pattern in the same line multiple times I get a huge group with information that I don't want:

href="(/namehere/nane2here/(option1|option2).*)"

I need the group between the parenthesis. This pattern repeats itself a lot of times in the string, they're all in the same line.

Example of a string I'm trying to get the values from:

<div>adasdsda<div>...lots of tags here... <a ... href="/name/name/option1/data1/data2"...anything here ...">src</a>...others HTML text here...<a ... href="/name/name/option2/data1"...

First of all, don't use regex on entire HTML structure. To learn why visit:

Instead try to parse HTML structure into object representing DOM which will let us easily traverse over all elements and find those which we are interested in.

One of (IMO) easiest to use HTML parsers can be found at https://jsoup.org/ . Its big plus is support for CSS selector syntax to find elements. It is described at https://jsoup.org/cookbook/extracting-data/selector-syntax where we can find

[attr~=regex] : elements with attribute values that match the regular expression; eg
img[src~=(?i)\.(png|jpe?g)]

In short [attr~=regex] will let us fund any element whose value of specified attribute can be even partially matched by regex.

With this your code can look something like:

String yourHTML =
        "<div>" +
        "   <a href='abc/def/1'>foo</a>" +
        "   <a href='abc/fed/2'>bar</a>" +
        "   <a href='abc/ghi/3'>bam</a>" +
        "</div>";
Document doc = Jsoup.parse(yourHTML);
Elements elementsWithHref = doc.select("a[href~=^abc/(def|fed)]");
for (Element element : elementsWithHref){
    String href = element.attr("href");
    System.out.println(href);
}

Output:

abc/def/1
abc/fed/2

(notice that there is no abc/ghi/3 since ^abc/(def|fed) can't be found in it)

Try "(?si)<[\\w:]+(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?(?<=\\s)href\\s*=\\s*(?:(['\"])\\s*((?:(?.\\1)?)*?/namehere/nane2here/(:?option1|option2)(:?(.?\\1):)*)\\s*\\1))\\s+(.?\".*?\"|'?*?'|[^>]*?)+>"

demo

feature:

  • finds specific href value contained in any tag
  • group 1 contains delimiter
  • group 2 contains the href value

\b is used to matche a word boundary

href="(/namehere/nane2here/(\\boption1\\b)|(\\boption2\\b).*)"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM