简体   繁体   English

Twitter样式URL正则表达匹配

[英]Twitter style URL Regex Matching

I'm trying to achieve very lax Regex match for a chat client using PHP. 我正在努力为使用PHP的聊天客户端实现非常宽松的Regex匹配。

The chat client must be able to pick up both complete and incomplete URLS. 聊天客户端必须能够获取完整和不完整的URL。

For example: 例如:

http://www.example.com or www.example.com or example.com http://www.example.comwww.example.comexample.com

I have set up a preg_replace that tries to achieve this: 我已经设置了一个试图实现这个目的的preg_replace:

$find = array("/([\w]+:\/\/[\w-?&;#~=\.\/\@]+[\w\/])/is","/(^(?!http:\/\/)[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4}(\/?\S*)?)/is");
$replace = array( "<a target=\"_blank\" href=\"http://\\1\">\\1</a>","<a target=\"_blank\" href=\"\\1\">\\1</a>");
$output = preg_replace($find, $replace, $input); 

So, the aim is to 1st find "complete" URLs with the protocol, then try to find "lazy" URLs that do not have the protocol. 因此,目标是首先找到带有协议的“完整”URL,然后尝试查找没有协议的“懒惰”URL。

Currently it works great for the "complete" URLs, but the "lazy" URLs do not get picked up. 目前它适用于“完整”网址,但“懒惰”网址无法获取。

Any help will be greatly appreciated. 任何帮助将不胜感激。

Thanks. 谢谢。

I set up something similar a while ago. 我刚才建立了类似的东西。 My thinking was ... anything that starts with a protocol identifier or a "www" is a URL, plus anything that matches a domain that ends in a valid TLD (two letters, or a known gTLD) if it's followed by a path. 我的想法是......以协议标识符或“www”开头的任何内容都是一个URL,以及与以有效TLD(两个字母或已知gTLD)结尾的域相匹配的任何内容(如果后跟路径)。 Domains by themselves are just domains. 域本身就是域。

$gtlds="com|net|org|biz|edu|gov|int|pro|xxx";
$gtlds+="|aero|arpa|asia|coop|museum|name|travel";
#$gtlds+="|xn-[a-z0-9]+";
$a = array(
  '/(f|ht)tps?:\/\/[^ ]+/',
  '/(ftp|www)\.[a-z0-9.-]+(/[^ ]*)/',
  "/([a-z0-9][a-z0-9-]*\.)+([a-z]{2}|$gtlds)\/[^ ]*/"
);

Note that I'm an old-school regexp user, so this is ERE, not that fancy PREG stuff all the kids are using these days. 请注意,我是一个老式的正则表达式用户,所以这是ERE,而不是所有孩子们现在使用的那种花哨的PREG东西。

The absurdly long list of gTLDs is from IANA . 荒谬的gTLD列表来自IANA I've updated it so it's valid as of the time of this answer except for .XN--* . 我已经更新了它,因此除了.XN--*之外它在答案时有效。 You can include the list of TLDs that start with .XN-- if you like, either with a pattern or by matching them directly and growing the $gtlds variable. 您可以包含以.XN--开头的TLD列表 - 如果您愿意,可以使用模式,也可以直接匹配它们并增加$gtlds变量。 I have never encountered any problems caused by simply ignoring their existence, so that's my strategy. 我从来没有遇到任何因忽略它们的存在而引起的问题,所以这就是我的策略。

The above REs worked for my specific use. 以上RE适用于我的具体用途。 I make no claim that they'll work for every case that is not mine. 我没有声称他们会为每一个不属于我的案件工作。 (For example, they will include trailing quotes, if a domain or URL is quoted. That was never something I had to deal with, so I didn't deal with it.) (例如,如果引用域或URL,它们将包括尾随引号。这绝不是我必须处理的事情,因此我没有处理它。)

Note that when you're doing your replacement, while you want your generator anchor's HREF attribute to be the URL that you match or generate, you probably want to leave the original text as-is for purposes of layout and display. 请注意,当您进行替换时,虽然您希望生成器锚点的HREF属性是您匹配或生成的URL,但您可能希望保留原始文本,以便进行布局和显示。

And depending how you use these, word boundaries may be helpful ... but you already know how to do those. 根据你如何使用这些,字边界可能会有所帮助......但你已经知道如何做到这些。

So after hours of toiling with it.. I managed to find a relatively easy way to match both http:// URLs and www. 因此,经过几个小时的劳动......我设法找到一种相对简单的方法来匹配http:// URL和www。 URLs in order to turn them into anchors. URL以便将它们变成锚点。

This is the final solution: 这是最终的解决方案:

$output = preg_replace("/\b((http(s)?:\/\/)?(www\.[a-zA-Z0-9\/\\\:\?\%\.\&\;=#\-\_\!\+\~\,]*))/is","<a target=\"_blank\" href=\"http$3://$4\">$0</a>",$output);

Thanks to tamouse for the regex. 感谢tamouse的正则表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM