简体   繁体   English

排除正则表达式中的网址格式

[英]Exclude url pattern in regex

This is my input string 这是我的输入字符串

<div>http://google.com</div><span data-user-info="{\\"name\\":\\"subash\\", \\"url\\" : \\"http://userinfo.com?userid=33\\"}"></span><a href="https://contact.me"></a>http://byebye.com is a dummy website.

for this case I need to match only first and last occurrence of http. 对于这种情况,我需要只匹配http的第一次和最后一次出现。 because those are innerText in html point of view. 因为那些是html观点的innerText。 http in attribute values we need to ignore. 我们需要忽略属性值中的http。 I build following regex. 我建立了以下正则表达式。

(?<!href=\"|src=\"|value=\"|href=\'|src=\'|value=\'|=)(http://|https://|ftp://|sftp://)

It is working fine for first and last occurrence. 它适用于第一次和最后一次。 but this is also matching the second occurrence of http. 但这也匹配第二次出现的http。 the link(http) in the attribute we don't need to match. 我们不需要匹配的属性中的链接(http)。

FYI : I am trying negative lookahead, but that is seems not helping. 仅供参考:我正在尝试消极前瞻,但这似乎没有帮助。 This is the one with negative lookahead. 这是一个负向前瞻的人。

(?<!href=\"|src=\"|value=\"|href=\'|src=\'|value=\'|=)(http://|https://|ftp://|sftp://).*?(?!>)

Update after having more details 有更多细节后更新

Another approach is to take benefit from regex's "greediness". 另一种方法是从正则表达式的“贪婪”中获益。 /(http).*(http)/g will match as much text as possible from the first to the last occurrence of "http". /(http).*(http)/g将匹配从“http”的第一次到最后一次出现的尽可能多的文本。 Below example illustrates this behavior. 下面的示例说明了此行为。 (http) are capturing groups - replace those with your full regex. (http)正在捕获组 - 用你的完整正则表达式替换它们。 I simplified the regex for easier understanding. 我简化了正则表达式以便于理解。

var text ='<div>http://google.com</div><span data-user-info="{\"name\":\"subash\", \"url\" : \"http://userinfo.com?userid=33\"}"></span><a href="https://contact.me"></a>http://byebye.com is a dummy website.'
var regex = /(http).*(http)/g;
var match = regex.exec(text);
//match[0] is entire matched text
var firstMatch = match[1]; // = "http"
var lastMatch = match[2]; // = "http"

This example is specific of JavaScript, but Java regexps (and many other regex engines) work the same way. 此示例特定于JavaScript,但Java regexps(以及许多其他正则表达式引擎)以相同的方式工作。 (http).*(http) would work too. (http).*(http)也可以。


Do you aim to match the first and the last line or the first and the last occurrence of a string? 您的目标是匹配第一行和最后一行或第一次和最后一次出现的字符串吗?

If the former is correct, I would split the text into lines first, and then regex-match the first and the last line. 如果前者是正确的,我会首先将文本拆分为行,然后将正则表达式匹配第一行和最后一行。

//Split into lines:
var lines = yourMultiLineText.split(/[\r\n]+/g);

If the latter is correct, find all matches with your basic pattern and from the array of matches take the first and the last one, eg: 如果后者是正确的,找到所有匹配你的基本模式,并从匹配数组中取第一个和最后一个,例如:

//Match using a simpler regex
var matches = yourMultiLineText.match(yourRegex);
//Store the result here
var result;
//Make sure that there are at least 2 matches in total for this to make sense.
if(matches.length > 1){
   //Grab the first and the last match.
   result = [matches[0], matches[matches.length - 1]];
} else {
   result = [];
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM