简体   繁体   中英

Javascript string split with regex

I am trying to split a string using a regular expression for links (urls).

The regex in question is

var regex = new RegExp('(?:^(?:(?:[a-z]+:)?//)(?:\S+(?::\S*)?@)?(?:localhost|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:[/?#]\S*)?$)')

If i do

regex.test("https://google.com"); // returns true

but doing -

"Go to https://google.com".split(regex); 
// return ["Go to https://google.com"]

Whereas i expect it to return

["Go to ", "https://google.com"]

Any idea what's going on here?

First of all, you're using a string literal to build your regex, which means that you have to escape your backslashes (since a backslash has a special meaning in strings, used for the line feed char \\n for example):

var regex = new RegExp('(?:^(?:(?:[a-z]+:)?//)(?:\\S+(?::\\S*)?@)?(?:localhost|(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))(?::\\d{2,5})?(?:[/?#]\\S*)?$)');

Another solution would be to use the regex literal, as JavaScript proposes one, but you would then have to escape the slashes:

var regex = /(?:^(?:(?:[a-z]+:)?\/\/)(?:\S+(?::\S*)?@)?(?:localhost|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:[\/?#]\S*)?$)/;

Then, your regex will try to match against the entire input due to the ^ and $ anchors . So if you remove them (or better, replace them with word boundaries \\b ), you'll be able to find URLs in a string for example:

var regex = /(?:\b(?:(?:[a-z]+:)?\/\/)(?:\S+(?::\S*)?@)?(?:localhost|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:[\/?#]\S*)?\b)/;

But, the main point is that you're misunderstanding the split concept. Given the string "hello world" , if you split by space, you'll end up with ["hello", "world"] : no more space anymore since it was the char that was used to split.

That is, if you split by the URL regex, the output array won't contain the URLs anymore. It seems to me that a lookahead could suit your needs:

var regex = /(?=(?:\b(?:(?:[a-z]+:)?\/\/)(?:\S+(?::\S*)?@)?(?:localhost|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:[\/?#]\S*)?\b))/;
"Go to https://google.com".split(regex) // ["Go to ", "https://google.com"]

The regex explained:

(?=(?:\b(?:(?:[a-z]+:)?//)(?:\S+(?::\S*)?@)?(?:localhost|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:[/?#]\S*)?\b))

正则表达式可视化

Debuggex Demo

By splitting a string with a positive lookahead (?=content_of_lookahead) , you'll split by each interchar that is followed by the content of the lookahead.

Take a look at 8 Regular Expressions You Should Know .

To match an url you can use following regex :

var regex = "(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w# \.-]*)*\/?$";

"Go to https://google.com".split(regex); 
// return ["https://google.com"]

Live example .

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM