简体   繁体   中英

Regex match string before or after and only return one match per set

I am trying to grab certain id's out of HTML code. I have some of it working, but other things I need help with. Here is some sample HTML code of videos:

<video id="movie1" class="show_movie-camera animation_target movieBorder hasAudio movieId_750" src="/path/to/movie" style="position: absolute; z-index: 505; top: 44.5px; left: 484px; display: none;" preload="true" autoplay="true"></video>
<video id="movie2" class="clickInfo movieId_587" src="/path/to/movie" preload="true" autoplay="true"></video>
<video id="movie300" src="/path/to/movie" preload="true" autoplay="true"></video>

To get the movie id's, I look for movieId_[ID] or movie[ID] using this regex:

.*?<object|<video.*?movie(\\d+)|movieId_(\\d+)[^>]*>?.*?

This works well, but it puts both movieId_[ID] AND movie[ID] in the matches, rather than just one. What I am looking for is to use movieId_[ID] and using movie[ID] as the fallback. This is what I use:

Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(content);
int fileId = -1;
while(m.find()) {
    fileId = -1;
    if (m.group(2) != null) {
        fileId = new Integer(m.group(2));
    } else if (m.group(1) != null) {
        fileId = new Integer(m.group(1));
    }
}

This will give me 1, 750, 2, 587, 300 instead of 750, 578, 300 that I am looking for.

Additionally, I am looking to get the matches that have the hasAudio class. Here is what I tried with no success:

.*?<object|<video.*?hasAudio.*movieId_(\\d+)|movieId_(\\d+).*hasAudio[^>]*>?.*?";

Any help would be appreciated. Thanks!

For the first issue check the below...

.*?<object|<video[^>]*((?<=movieId_)\d+|(?<=movie)\d+)

To make it work your java code would be

Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(content);
int fileId = -1;
while(m.find()) {
    fileId = -1;
    if (m.group(1) != null) {
        fileId = new Integer(m.group(1));
    }
}

Demo of regex here .


UPDATE FOR SECOND CONDITION

.*?<object|<video[^>]*hasAudio[^>]*((?<=movieId_)\d+|(?<=movie)\d+)

Demo of regex here


Explanation

.*?<object                 //Already existing regex
|                          //OR capture the movie ID as below
<video[^>]*hasAudio[^>]*   //Part of full match include all characters except '>'
                           //This makes sure matches do not go beyond the tag
                           //Also makes sure that hasAudio is part of this string
(                          //START: Our Group1 capture as Movie ID 
(?<=movieId_)\d+           //First try getting id out of moviedId_xxx
|                          //OR if first fails
(?<=movie)\d+              //Second try getting id out of moviexxx
)                          //END: Our Group1 capture as Movie ID

Note: .*?<object would always match only <object !!!


UPDATE 2

<object|<video[^>]*\K(?:hasAudio[^>]*\K(?:(?<=movieId_)\d+|(?<=movie)\d+)|(?:(?<=movieId_)\d+|(?<=movie)\d+)(?=[^>]*hasAudio))

Here I introduced condition for trailing hasAudio if any. Note that in this regex the full match is the movieID, there would be no groups.

Main feature we used here is the \\K flag which resets the match position to current. There by dropping all previously grabbed chars from the match. This helps us get around variable length look-behind.

Demo here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM