简体   繁体   English

PHP-查找帖子中的所有超链接,添加target和rel = nofollow属性

[英]PHP - find all hyperlinks in a post, add target and rel=nofollow attribute

I need to find a way to read content posted by user to find any hyperlinks that might have been included, create anchor tags, add target and rel=nofollow attribute to all those links. 我需要找到一种方法来读取用户发布的内容,以查找可能包含的任何超链接,创建锚标记,将target和rel = nofollow属性添加到所有那些链接。

I have come across some REGEX solutions like this: 我遇到过一些这样的REGEX解决方案:

 (?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

But on other questions on SO about the same problem, it has been highly recommended NOT to use REGEX instead use DOMDocument of PHP. 但是在关于同一问题的其他问题上,强烈建议不要使用REGEX而不是PHP的DOMDocument

Whatever be the best way, I need to add some attributes like mentioned above in order to harden all external links on website. 无论采用哪种最佳方法,我都需要添加如上所述的一些属性,以加强网站上的所有外部链接。

您可能对Goutte感兴趣,可以定义自己的过滤器等。

First of all, the guidelines you mentioned advised against parsing HTML with regexes. 首先,您提到的指南建议不要使用正则表达式解析HTML As far as I understand, what you are trying to do is to parse plain text from user and convert it into HTML . 据我了解,您要尝试的是解析用户的纯文本并将其转换为HTML For that purpose, regexes are usually just fine. 为此,正则表达式通常就可以了。

(Note that I assume you parse the text into links yourself and aren't using external library for that. In the latter case you'd need to fix the HTML the library outputs, and for this you should use DOMDocument to iterate over all <a> tags and add them proper attributes.) (请注意,我假设您自己将文本解析为链接,并且没有使用外部库。在后一种情况下,您需要修复库输出的HTML,为此,您使用DOMDocument遍历所有<a>标记,并为其添加适当的属性。)

Now, you can parse it in two ways: server side, or client side. 现在,您可以通过两种方式解析它:服务器端或客户端。

Server side 服务器端

Pros: 优点:

  • It outputs ready to use HTML. 它输出准备好使用HTML。
  • It doesn't require users to enable Javascript. 它不需要用户启用Javascript。

Cons: 缺点:

  • You need to add rel="nofollow" attribute for the bots to not follow the links. 您需要为机器人添加rel="nofollow"属性,以使其不遵循链接。

Client side 客户端

Pros: 优点:

  • You don't need to add rel="nofollow" attribute for the bots, since they don't see the links in the first place - they're generated with Javascript and bots usually don't parse Javascript. 您无需为漫游器添加rel="nofollow"属性,因为它们首先看不到链接-它们是使用Javascript生成的,并且漫游器通常不会解析Javascript。

Cons: 缺点:

  • Creating links that way requires users to enable Javascript. 以这种方式创建链接需要用户启用Javascript。
  • Implementing stuff like that in Javascript can give the impression that site is slow, especially if there is a lot of text to parse. 在Javascript中实施类似的操作可能会给人留下网站运行缓慢的印象,尤其是在要解析大量文本的情况下。
  • It makes caching parsed text difficult. 这使得缓存解析的文本变得困难。

I'll focus on implementing it server-side. 我将专注于在服务器端实现它。

Server-side implementation 服务器端实施

So, in order to parse links from user input and add them any attribute you want, you can use something like this: 因此,为了解析来自用户输入的链接并将它们添加到所需的任何属性,可以使用如下所示的内容:

<?php
function replaceLinks($text)
{
    $regex = '/'
      . '(?<!\S)'
      . '(((ftp|https?)?:?)\/\/|www\.)'
      . '(\S+?)'
      . '(?=$|\s|[,]|\.\W|\.$)'
      . '/m';

    return preg_replace_callback($regex, function($match)
    {
        return '<a'
          . ' target=""'
          . ' rel="nofollow"'
          . ' href="' . $match[0] . '">'
          . $match[0]
          . '</a>';
    }, $text);
}

Explanation: 说明:

  • (?<!\\S) : not preceded by non-whitespace characters. (?<!\\S) :不能以非空白字符开头。
  • (((ftp|https?)?:?)\\/\\/|www\\.) : accept ftp:// , http:// , https:// , :// , // and www. (((ftp|https?)?:?)\\/\\/|www\\.) :接受ftp://http://https://:////www. as beginning of URLs. 作为网址的开头。
  • (\\S+?) match everything that is not whitespace in non-greedy fashion. (\\S+?)以非贪婪的方式匹配所有非空格。
  • (?=$|\\s|[,]|\\.\\W|\\.$) every URL must be follow by either end of line, a whitespace, a comma, a dot followed by character other than \\w (this is to allow .com , .co.jp etc to match) or by a dot followed by end of line. (?=$|\\s|[,]|\\.\\W|\\.$)每个URL都必须在行尾,空格,逗号,点后加上\\w以外的其他字符(这是以允许.com.co.jp等匹配)或在.co.jp点后面加上一个点。
  • m flag - match multiline text. m标志-匹配多行文字。

Testing 测试

Now, to support my claim that it works I added a few test cases: 现在,为了支持我的说法,我添加了一些测试用例:

$tests = [];
$tests []= ['http://example.com', '<a target="" rel="nofollow" href="http://example.com">http://example.com</a>'];
$tests []= ['https://example.com', '<a target="" rel="nofollow" href="https://example.com">https://example.com</a>'];
$tests []= ['ftp://example.com', '<a target="" rel="nofollow" href="ftp://example.com">ftp://example.com</a>'];
$tests []= ['://example.com', '<a target="" rel="nofollow" href="://example.com">://example.com</a>'];
$tests []= ['//example.com', '<a target="" rel="nofollow" href="//example.com">//example.com</a>'];
$tests []= ['www.example.com', '<a target="" rel="nofollow" href="www.example.com">www.example.com</a>'];
$tests []= ['user@www.example.com', 'user@www.example.com'];
$tests []= ['testhttp://example.com', 'testhttp://example.com'];
$tests []= ['example.com', 'example.com'];
$tests []= [
    'test http://example.com',
    'test <a target="" rel="nofollow" href="http://example.com">http://example.com</a>'];
$tests []= [
    'multiline' . PHP_EOL . 'blah http://example.com' . PHP_EOL . 'test',
    'multiline' . PHP_EOL . 'blah <a target="" rel="nofollow" href="http://example.com">http://example.com</a>' . PHP_EOL . 'test'];
$tests []= [
    'text //example.com/slashes.php?parameters#fragment, some other text',
    'text <a target="" rel="nofollow" href="//example.com/slashes.php?parameters#fragment">//example.com/slashes.php?parameters#fragment</a>, some other text'];
$tests []= [
    'text //example.com. new sentence',
    'text <a target="" rel="nofollow" href="//example.com">//example.com</a>. new sentence'];

Each test case is composed of two parts: source input and expected output. 每个测试用例都由两部分组成:源输入和预期输出。 I used following code to determine whether the function passes the tests above: 我使用以下代码确定该函数是否通过了上面的测试:

foreach ($tests as $test)
{
    list ($source, $expected) = $test;
    $actual = replaceLinks($source);
    if ($actual != $expected)
    {
        echo 'Test ' . $source . ' failed.' . PHP_EOL;
        echo 'Expected: ' . $expected . PHP_EOL;
        echo 'Actual:   ' . $actual . PHP_EOL;
        die;
    }
}
echo 'All tests passed' . PHP_EOL;

I think this gives you idea how to solve the problem. 我认为这为您提供了解决问题的方法。 Feel free to add more tests and experiment with regex itself to make it suitable for your specific needs. 随时添加更多测试并使用regex进行试验,使其适合您的特定需求。

Get the content to post using jquery and process it before posting it to PHP. 获取要发布的内容并使用jquery进行处理,然后再将其发布到PHP。

$('#idof_content').val(
  $('#idof_content').val().replace(/\b(http(s|):\/\/|)(www\.\S+)/ig,
    "<a href='http\$2://\$3' target='_blank' rel='nofollow'>\$3</a>"));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM