简体   繁体   English

正则表达式查找是否以HTML换行的字符串形式的URL

[英]Regex to find URL in string wrapped in HTML or not

There are literally hundreds of question here on SE ( and on the web in general ) regarding this issue - and I tried a LOT But I can not find the Ultimate catch-all regex expression. SE上(以及一般而言,在整个Web上)这里确实有数百个关于此问题的问题-我尝试了很多,但我找不到终极的正则表达式。

Feel free to jump to the The TL;DR version below... 随时跳转到下面的TL; DR版本...

I need to parse a string to catch all URLS. 我需要解析一个字符串以catch所有URL。

I am using this now ( closest I got to work) 我现在正在使用它(最接近工作的地方)

$content = preg_replace_callback( '/((http[s]?:|www[.])[^\s]*)/i', 'my_callback', $content );

Problem is - it is not catching ALL urls .. 问题是-它没有捕获所有URL ..

    http://designscrazed.com/personal-wordpress-blog-themes/ <-- OK
    https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template <-- OK
    www.tuicool.com/articles/rqAzU3   <-- OK
    html5up.net/overflow/   <-- NOT WORKING
    http://www.tuicool.com/articles/rqAzU3    <-- OK
    http://live.btoa.com.au/spotfinder/docs/#ByVCPlik   <-- OK
    www.designrazzi.com/2013/free-css3-html5-templates/    <-- OK
    themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/   <-- NOT WORKING

I also tried without the WWW 我也尝试了没有WWW

$content = preg_replace_callback( '/(http[s]?:[^\s]*)/i', 'my_callback', $content );

and even 乃至

 $content = preg_replace_callback( '#[-a-zA-Z0-9@:%_\+.~\#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~\#?&//=]*)?#i', 'my_callback', $content );

All three cases do not work for urls wrapped in HTML link ... 这三种情况均不适用于HTML链接中包装的网址...

For example , in a link like 例如,在类似的链接中

 <a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>

it will catch the url almost correctly , but will leave the HTML part AFTER .. 它将几乎正确地捕获url,但将在HTML部分之后保留..

http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>

producing 生产

THIS WAS CAUGHT" target="_blank">SE</a>

The TL;DR version : TL; DR版本:

I basically need a regex to catch ALL urls , in a clean way of the variants : 我基本上需要一个正则表达式来以各种变体的方式捕获所有url:

http://www.example.com
http://example.com/
http://www.example.com/seconday/somepage#hashes?parameters
http://www.example.com/seconday/
http://www.example.com/seconday
http://example.com/seconday
http://example.com/seconday/

All of the above with http , https or without protocol prefix ( eg example.com/seconday ). 以上所有带有httphttps或不带协议前缀的示例(例如example.com/seconday )。

On top of that - all of those can be wrapped in HTML like 最重要的是-所有这些可以像HTML一样包装

http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank" some_attribute='somevalue' >SE</a>

EDIT I ( after comments) 编辑我 (评论后)

I write can because some are also "free standing" where methods like Dom parsing with DOMDocument or SimpleHTMLDOM would fail because they are not inside an HTML tag <a> or do not have href attributes ( like in comment - Think of parsing this very own page with this question itself. How can DOM parsing catch the URLS that are inside a <code> tag ? ) 我之所以写这样,是因为有些方法也是“独立的”,其中使用DOMDocumentSimpleHTMLDOM进行 Dom之类的方法将失败,因为它们不在HTML标记<a>或没有href属性(例如在注释中-考虑自行解析)本身带有此问题的页面。DOM分析如何捕获<code>标记内的URL?)

Okay, so I took a stab at this and came up with the following REGEX. 好吧,所以我对此做了个尝试,并提出了以下REGEX。 I'm sure it's not going to catch everything, but it does seem to catch all of the URLs you've listed on this page. 我确定它不会捕获所有内容,但似乎确实可以捕获您在此页面上列出的所有URL。 Here is an example: 这是一个例子:

// HERE IT IS LOOPING THROUGH AN ARRAY
$url_array = array('http://www.example.com', 'http://example.com/', 'http://www.example.com/seconday/somepage#hashes?parameters', 'http://www.example.com/seconday/', 'http://www.example.com/seconday', 'http://example.com/seconday', 'http://example.com/seconday/', 'http://designscrazed.com/personal-wordpress-blog-themes/', 'https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template', 'www.tuicool.com/articles/rqAzU3', 'html5up.net/overflow/', 'http://www.tuicool.com/articles/rqAzU3', 'http://live.btoa.com.au/spotfinder/docs/#ByVCPlik', 'www.designrazzi.com/2013/free-css3-html5-templates/', 'themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/', '<a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>');

$extension_array = array('com', 'net', 'org', 'biz');

foreach ($url_array AS $url) {

    print '<br>'.$url;
    if (preg_match('~(?:(?:http(?:s)?://)?(?:www\.)?[-A-Z0-9.]+(?:\.'.implode('|', $extension_array).')[-A-Z0-9_./]?(?:[-A-Z0-9#?/]+)?)~i', $url, $m)) {
        print "<pre><font color='orange'>"; print_r($m); print "</font></pre>";
    }

}

Or here is the same thing but using a string of text like what you are actually working with: 或者这是相同的事情,但是使用的是类似您实际使用的文本字符串:

$urls_as_string = 'asd a http://www.example.com w223 http://example.com/  ionsipn  http://www.example.com/seconday/somepage#hashes?parameters opajiw348283 http://www.example.com/seconday/ 20923[\'#$%#$ http://www.example.com/seconday wwwe http://example.com/seconday               http://example.com/seconday/ 00000002222 http://designscrazed.com/personal-wordpress-blog-themes/ +_)(&^&%$ https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template oopeorop  www.tuicool.com/articles/rqAzU3 03083 2h1hh1`  html5up.net/overflow/ kksllkwpo2 http://www.tuicool.com/articles/rqAzU3  la;sl2i2i3okn2 http://live.btoa.com.au/spotfinder/docs/#ByVCPlik black cat www.designrazzi.com/2013/free-css3-html5-templates/ asdf themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/ l  www <a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>';

$extension_array = array('com', 'net', 'org', 'biz');

if (preg_match_all('~(?:(?:http(?:s)?://)?(?:www\.)?[-A-Z0-9.]+(?:\.'.implode('|', $extension_array).')[-A-Z0-9_./]?(?:[-A-Z0-9#?/]+)?)~i', $url_string, $m)) {
    print "<pre><font color='red'>"; print_r($m); print "</font></pre>";
}

Also, you can add the pattern modifiers 'ms' if you are searching through multiple lines instead of a single line like I have in my example. 另外,如果要搜索多行而不是像我的示例中的一行那样,则可以添加模式修饰符'ms'。

EDIT: 编辑:

There is an error in the previous code where I am calling $url_string in my matching line, when I had named the variable $urls_as_string when I set the content. 当我在设置内容时将变量命名为$urls_as_string时,在前面的代码中有一个错误,我在匹配行中调用$url_string If you correct the variable name, it should work as expected. 如果您更正了变量名称,它应该可以正常工作。

Anyway, I took the code above and modified it to work with preg_replace_callback like you had requested. 无论如何,我采用了上面的代码,并按照您的要求对其进行了修改,使其可与preg_replace_callback一起使用。 This seems to work with all of the URLs that you had listed. 这似乎适用于您列出的所有URL。 Check it out: 看看这个:

// CREATE THE STRING
$urls_as_string = 'asd a http://www.example.com w223 http://example.com/  ion
sipn  http://www.example.com/seconday/somepage#hashes?parameters 



opajiw348283 http://www.example.com/seconday/ 20923[\'#$%#$ http://www.example.com/seconday ww
we http://example.com/seconday               http://example.com/seconday/ 000000
02222 http://designscrazed.com/personal-wordpress-blog-themes/ +_)(&^&%$ https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template oopeo
rop  www.tuicool.com/articles/rqAzU3 03083 2h1hh1`  html5up.net/overflow/ kksllkwpo2 http://www.tuicool.com/articles/rqAzU3  la;s
l2i2i3okn2 http://live.btoa.com.au/spotfinder/docs/#ByVCPlik black cat www.designrazzi.com/2013/free-css3-html5-templates/ as
df themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/ l 
 www <a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>';


// SET SOME DOMAIN EXTENSIONS
$extension_array = array('com', 'net', 'org', 'biz');



// CHECK TO SEE IF OUR REGEX IS WORKING ... PRINT OUT ALL OF THE MATCHES
if (preg_match_all('~(?:(?:http(?:s)?://)?(?:www\.)?[-A-Z0-9.]+(?:\.'.implode('|', $extension_array).')[-A-Z0-9_./]?(?:[-A-Z0-9#?/]+)?)~ims', $urls_as_string, $m)) {
    print_r($m);
}



// USE PREG_REPLACE_CALLBACK TO FORMAT THE URLS
$content = preg_replace_callback( '~(?:(?:http(?:s)?://)?(?:www\.)?[-A-Z0-9.]+(?:\.'.implode('|', $extension_array).')[-A-Z0-9_./]?(?:[-A-Z0-9#?/]+)?)~ims', 'my_callback', $urls_as_string);



// PRINT OUT THE FINISHED STRING
print "\n\n\n\nFINAL OUTPUT: \n".$content;



// THIS FUNCTION DOES A CRAPTASTIC JOB AT FORMATTING URLS
function my_callback($m) {

    $url = $m[0];
    $url_formatted = $url;

    if (!preg_match('~^http(s)?://~', $url)) {
        $url_formatted = 'http://'.$url;
    }

    $url_formatted = '<a href="'.$url.'">'.$url.'</a>';

    return $url_formatted;

}

Here is a working demo of the code 这是代码的工作演示

The callback function I wrote is pretty stupid, but I'm assuming you already have a function that you are going to use. 我编写的回调函数非常愚蠢,但是我假设您已经具有要使用的函数。 This is just to demonstrate that it is doing what it's supposed to be doing. 这只是为了证明它正在执行应做的事情。 Hopefully this solution solves your problem. 希望此解决方案可以解决您的问题。 If not, let me know and I can work on it some more. 如果没有,请告诉我,我可以做更多的工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM