简体   繁体   中英

Javascript RegExp returning unwanted characters

I've got this string:

<AdParameters>
    <VpaidClickThrough><![CDATA[http://media.adrcdn.com/ads/exit.html]]></VpaidClickThrough>
    <VpaidClickTracking><![CDATA[]]></VpaidClickTracking> 
    <VpaidPath><![CDATA[http%3A%2F%2Fmedia.adrcdn.com%2Fads%2FAdrime%2F3130343734%2F61112%2F]]></VpaidPath> 
    <VpaidDuration><![CDATA[]]></VpaidDuration>
    <VpaidId><![CDATA[e322f52bc813f05beacb6fe522a52f20]]></VpaidId>
</AdParameters>
<MediaFiles>
    <MediaFile id="0" maintainAspectRatio="false" scalable="false" delivery="progressive"  width="640" height="360" apiFramework='VPAID' type="application/x-shockwave-flash">  <![CDATA[http%3A%2F%2Fmedia.adrcdn.com%2Fads%2FAdrime%2F3130343734%2F61112%2Fmediafile_lineair_640x360.swf?VpaidId=e322f52bc813f05beacb6fe522a52f20&VpaidPath=http%3A%2F%2Fmedia.adrcdn.com%2Fads%2FAdrime%2F3130343734%2F61112%2F]]></MediaFile>
<MediaFiles>

And I want to extract from here all the ENCODED URLs. So I'm using this RegExp:

(http\%3A.*)\?|(http\%3A.*)\]\]

But what I get is this:

http%3A%2F%2Fmedia.adrcdn.com%2Fads%2FAdrime%2F3130343734%2F61112%2F]]
http%3A%2F%2Fmedia.adrcdn.com%2Fads%2FAdrime%2F3130343734%2F61112%2Fmediafile_lineair_640x360.swf?
http%3A%2F%2Fmedia.adrcdn.com%2Fads%2FAdrime%2F3130343734%2F61112%2F]] 

It's quite ok but I don't want the final "]]" and "?" How do I get the URLs without those ending characters?

It's strange because trying my regex here http://regex101.com/r/zS0tZ8 it looks to work perfectly.

Thank you in advance.

In regex101 I believe you are considering the captured group, but that's not all the regex returns: the match itself will be what's matched by the whole regex, not only what's inside parenthesis.

This basically means you've got to ways of solving your issue:

  • return the first captured group . Your regex does the job alright, you just need to return the correct captured value. (BTW, no need to escape ]] . You can factorize it with (http%3A.*?)(?:\\?|]]) , the (?: ) being a non-capturing group)

  • edit your regex so that the end delimiter isn't part of the match . Something with look ahead could work, like http%3A.*?(?=\\?|]]) (notice there's no need for parenthesis anymore), but you could probably achieve the same thing with:

     http%3A[^]?]* 

    The [^ ] meaning "anything but what's inside the brackets".

There are a number of solutions to this, but this is what I prefer:

http%3A[\w%.]*

This just matches what's in a valid encoded URL, without worrying about what comes afterward.

http%3A.*?(?=\?|]])

should do the job

EDIT: little explanation:

(?=regex)

...tests the regex without adding the results to the match. It's called "positive lookahead".

I'm not sure how you used your RegExp, but this should work:

function extractEncodedURLs(str) {
  var pattern = /(http%3A.*?)(\?|]])/g;

  var results = [];
  var match;
  while (match = pattern.exec(str)) {
    results.push(match[1]);
  }
  return results;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM