简体   繁体   English

如何使用JavaScript和Regex提取网址的一部分

[英]How to extract a part of the url using JavaScript and Regex

I want to extract some data from the url's which have the following format : 我想从具有以下格式的网址中提取一些数据:

http://www.example.com/biglasses/pr?p[]=ets.ideal_for%255B%255D%3Ds&p[]=ets.ideal_for%255B%255D%3Dn&p[]=sort%3Dpopularity&sid=23426x&offer=bigglassesMin30_RipoP.&ref=8be2b7f4-521c-4c45-9021-33d1df588eb9&mycracker=ch_vn_men_sungla_promowidget_banner_0_image

http://www.example.com/cooks/cooking-dress-wine/~no-order/pr?p%5B%5D=sort%3Dfeatured&sid=bks%2C43p&mycracker=ch_vn_clothing_subcategory_Puma&ref=b41c8097-8efe-4acf-8919-0fa81bcb590a

http://www.example.com/biglasses/pr?p[]=ets.ideal_for%255B%255D%3Ds&p[]=ets.ideal_for%255B%255D%3Dn&p[]=sort%3Dpopularity&sid=23426x&ref=8be2b7f4-521c-4c45-9021-33d1df588eb9&mycracker=ch_vn_men_sungla_promowidget_banner_0_image&offer=bigglassesMin30_RipoP.

Basically I want to get rid of &myCracker and its value and &ref and its value and the domain name part ie http://www.example.com 基本上,我想摆脱&myCracker及其值和&ref及其值和域名部分,即http://www.example.com

As can be seen the useful part of the url data is interspersed between these characters namely &myCracker and its value and &ref and its value. 可以看出,URL数据的有用部分散布在这些字符之间,即&myCracker和它的值以及&ref和它的值。

I am trying like this : 我正在这样尝试:

var mapObj = {"/^(http:\/\/)?.*?\//":"","(&mycracker.+)":"","(&ref.+)":""};
var re = new RegExp(Object.keys(mapObj).join("|"),"gi");
url = url.replace(re, function(matched){
    return mapObj[matched];
});

So that I could replace all the matching parts at once with an empty string. 这样我就可以用空字符串一次替换所有匹配的部分。
But its not working. 但是它不起作用。

I understand I need to selectively remove those parts of the url without making any assumptions about their order of appearance, but how should I go about it. 我知道我需要有选择地删除网址的那些部分,而无需对它们的出现顺序做任何假设,但是我应该怎么做。

Thanks 谢谢

The easiest way would be to replace them with an empty string, leaving just the bits you want. 最简单的方法是用空字符串替换它们,只保留所需的位。

inputStr.replace(/^https?:\/\/[^\/]+\/|&?(mycracker|ref)=[^&]*/g, '')

Here is a JSFiddle: http://jsfiddle.net/4L6BH/1/ 这是一个JSFiddle: http : //jsfiddle.net/4L6BH/1/

The regex is pretty straight forward. 正则表达式非常简单。 There are essentially two parts grouped together: ^https?:\\/\\/[^\\/]+\\/ and &?(mycracker|ref)=[^&]* 基本上有两个部分组合在一起: ^https?:\\/\\/[^\\/]+\\/&?(mycracker|ref)=[^&]*

The first part gets any domain (with any sub-domains). 第一部分获得任何域(带有任何子域)。 If you are just using one domain, you could clarify it to just that one domain (but that would also reduce flexibility). 如果您仅使用一个域,则可以将其解释为仅一个域(但这也会降低灵活性)。 It also optionally does both http and https protocols (hence the s? ). 还可以选择同时使用http和https协议(因此使用s? )。

The second part gets the parameters that we don't care about and scraps them. 第二部分获取我们不关心的参数并将其报废。 Since they may be at the beginning (and thus not have an &), we only optionally look for that. 由于它们可能在开头(因此没有&),因此我们仅选择查找。 We then have the items we want to replace, delimited with a |. 然后,我们要替换的项目以|分隔。 Then we scoop up it's value, which would be anything until the next & or the end of string). 然后我们获取它的值,该值可以是字符串的下一个&或结尾为止的任何值)。

The last special bit, we add the g flag to make sure it replaces all instances (without it, it'll only do the first thing, which would be the domain). 最后一个特殊位,我们添加g标志以确保它替换所有实例(没有它,它将仅做第一件事,这将是域)。

We just grab those bits, replace them with an empty string, and viola. 我们只是抓住这些位,用空字符串替换它们,然后中提琴。

The JavaScript string.replace function sends the text that was matched in the matched parameter. JavaScript的功能与string.replace发送这是在匹配的文本matched参数。 The code seems to expect it to return the regular expression text that was used as a key in mapObj. 该代码似乎期望它返回用作mapObj中的键的正则表达式文本。 Perhaps it should just be url.replace(re,'') 也许应该只是url.replace(re,'')

The first regex shouldn't start or end with a "/". 第一个正则表达式不应以“ /”开头或结尾。

I would go with @samanime, but make a slight change. 我会选择@samanime,但要稍做更改。

Find: /^https?:\\/\\/[^\\/]+|(?:(\\?)|&)(?:mycracker|ref)=[^&]*/g Replace '\\1' 查找: /^https?:\\/\\/[^\\/]+|(?:(\\?)|&)(?:mycracker|ref)=[^&]*/g替换'\\1'

    ^ https?:// [^/]+      
 |       
    (?:     
         ( \? )               # (1)     
      |  &     
    )     
    (?: mycracker | ref )     
    = [^&]*      

edit 编辑
Not knowing the parameters in url lines, but just as a parsing note .. 不知道网址行中的参数,但仅作为解析说明..
Stripping out the vars could be done like below. 删除var可以如下进行。
I could be way off here, but if the ? 我可以在这儿走,但如果? is used as a domain/parameter list 用作域/参数列表
separator, to maintain continuity, a couple of extra conditions might apply. 分隔符,以保持连续性,可能需要满足几个附加条件。
Still need to replace with capture group 1 every time. 每次仍需要用捕获组1替换。

     #  /^https?:\/\/[^\/]+|(?:(\?)(?:mycracker|ref)=[^&]*&)|(?:\?(?:mycracker|ref)=[^&]*$)|(?:&(?:mycracker|ref)=[^&]*)/g

     # Domain
     ^ https?:// [^/]+ 
  |  
     # (?)var=&
     (?:
          ( \? )               # (1)
          (?: mycracker | ref )
          = [^&]*      
          &                    # &
     )
  |  
     # ?var=(EOS)
     (?:
          \?
          (?: mycracker | ref )
          = [^&]*      
          $                    # EOS
     )
  |  
     # &var=
     (?:
          &     
          (?: mycracker | ref )
          = [^&]*      
     )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM