简体   繁体   中英

How to extract a part of the url using JavaScript and Regex

I want to extract some data from the url's which have the following format :

http://www.example.com/biglasses/pr?p[]=ets.ideal_for%255B%255D%3Ds&p[]=ets.ideal_for%255B%255D%3Dn&p[]=sort%3Dpopularity&sid=23426x&offer=bigglassesMin30_RipoP.&ref=8be2b7f4-521c-4c45-9021-33d1df588eb9&mycracker=ch_vn_men_sungla_promowidget_banner_0_image

http://www.example.com/cooks/cooking-dress-wine/~no-order/pr?p%5B%5D=sort%3Dfeatured&sid=bks%2C43p&mycracker=ch_vn_clothing_subcategory_Puma&ref=b41c8097-8efe-4acf-8919-0fa81bcb590a

http://www.example.com/biglasses/pr?p[]=ets.ideal_for%255B%255D%3Ds&p[]=ets.ideal_for%255B%255D%3Dn&p[]=sort%3Dpopularity&sid=23426x&ref=8be2b7f4-521c-4c45-9021-33d1df588eb9&mycracker=ch_vn_men_sungla_promowidget_banner_0_image&offer=bigglassesMin30_RipoP.

Basically I want to get rid of &myCracker and its value and &ref and its value and the domain name part ie http://www.example.com

As can be seen the useful part of the url data is interspersed between these characters namely &myCracker and its value and &ref and its value.

I am trying like this :

var mapObj = {"/^(http:\/\/)?.*?\//":"","(&mycracker.+)":"","(&ref.+)":""};
var re = new RegExp(Object.keys(mapObj).join("|"),"gi");
url = url.replace(re, function(matched){
    return mapObj[matched];
});

So that I could replace all the matching parts at once with an empty string.
But its not working.

I understand I need to selectively remove those parts of the url without making any assumptions about their order of appearance, but how should I go about it.

Thanks

The easiest way would be to replace them with an empty string, leaving just the bits you want.

inputStr.replace(/^https?:\/\/[^\/]+\/|&?(mycracker|ref)=[^&]*/g, '')

Here is a JSFiddle: http://jsfiddle.net/4L6BH/1/

The regex is pretty straight forward. There are essentially two parts grouped together: ^https?:\\/\\/[^\\/]+\\/ and &?(mycracker|ref)=[^&]*

The first part gets any domain (with any sub-domains). If you are just using one domain, you could clarify it to just that one domain (but that would also reduce flexibility). It also optionally does both http and https protocols (hence the s? ).

The second part gets the parameters that we don't care about and scraps them. Since they may be at the beginning (and thus not have an &), we only optionally look for that. We then have the items we want to replace, delimited with a |. Then we scoop up it's value, which would be anything until the next & or the end of string).

The last special bit, we add the g flag to make sure it replaces all instances (without it, it'll only do the first thing, which would be the domain).

We just grab those bits, replace them with an empty string, and viola.

The JavaScript string.replace function sends the text that was matched in the matched parameter. The code seems to expect it to return the regular expression text that was used as a key in mapObj. Perhaps it should just be url.replace(re,'')

The first regex shouldn't start or end with a "/".

I would go with @samanime, but make a slight change.

Find: /^https?:\\/\\/[^\\/]+|(?:(\\?)|&)(?:mycracker|ref)=[^&]*/g Replace '\\1'

    ^ https?:// [^/]+      
 |       
    (?:     
         ( \? )               # (1)     
      |  &     
    )     
    (?: mycracker | ref )     
    = [^&]*      

edit
Not knowing the parameters in url lines, but just as a parsing note ..
Stripping out the vars could be done like below.
I could be way off here, but if the ? is used as a domain/parameter list
separator, to maintain continuity, a couple of extra conditions might apply.
Still need to replace with capture group 1 every time.

     #  /^https?:\/\/[^\/]+|(?:(\?)(?:mycracker|ref)=[^&]*&)|(?:\?(?:mycracker|ref)=[^&]*$)|(?:&(?:mycracker|ref)=[^&]*)/g

     # Domain
     ^ https?:// [^/]+ 
  |  
     # (?)var=&
     (?:
          ( \? )               # (1)
          (?: mycracker | ref )
          = [^&]*      
          &                    # &
     )
  |  
     # ?var=(EOS)
     (?:
          \?
          (?: mycracker | ref )
          = [^&]*      
          $                    # EOS
     )
  |  
     # &var=
     (?:
          &     
          (?: mycracker | ref )
          = [^&]*      
     )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM