在頁面上而不是在其他詞中查找被禁止的詞

Question

我正在嘗試在 Web 代理上添加禁用詞過濾器。 我不是在頁面上的其他詞中搜索禁用詞，而是在加載的頁面中搜索禁用詞。 我實際上並不是在其他詞中尋找被禁止的詞，而是在頁面內（元標簽、內容）。

因此，如果我正在尋找“公雞”這個詞，那么“公雞”這個詞不應該觸發過濾器。

我剛剛測試了這段代碼，是的，正如預期的那樣，代碼可以工作，但是您可以猜到有很多 CPU 功率循環通過。 頁面加載的那一刻，另一刻它變灰並顯示頁面加載時間過長的跡象。 而這一切都在本地主機上。 現在，我可以想象我的虛擬主機會做什么！ 所以現在，我們將不得不想出一個更好的解決方案。 有任何想法嗎？ 我們沒有讓腳本在加載的頁面上檢查所有禁用詞怎么樣？ 一旦找到 1 個被禁止的詞，我們就讓腳本停止，並且已經做出回聲，找到了哪個被禁止的詞以及在頁面上的哪個位置？ （元標記、正文內容等）。 任何代碼建議？

這是我到目前為止所得到的：

<?php

/*
ERROR HANDLING
*/

// 1). $curl is going to be data type curl resource.
$curl = curl_init();

// 2). Set cURL options.
curl_setopt($curl, CURLOPT_URL, 'https://www.buzzfeed.com/mjs538/the-68-
words-you-cant-say-on-tv?utm_term=.xlN0R1Go89#.pbdl8dYm3X');
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true );

// 3). Run cURL (execute http request).
$result = curl_exec($curl);
$response = curl_getinfo( $curl );

if( $response['http_code'] == '200' )
    {
        //Set banned words.
        $banned_words = array("Prick","Dick","***");

        //Separate each words found on the cURL fetched page.
        $word = explode(" ", $result);

       //var_dump($word);

       for($i = 0; $i <= count($word); $i++)
       {
           foreach ($banned_words as $ban) 
           {
              if (strtolower($word[$i]) == strtolower($ban))
              {
                  echo "word: $word[$i]<br />";
                  echo "Match: $ban<br>";
           }
          else
           {
                 echo "word: $word[$i]<br />";
                 echo "No Match: $ban<br>";  
            }
         }
      }
   }  

// 4). Close cURL resource.
curl_close($curl);

我被告知要這樣做：

將頁面加載到字符串中。 在加載的字符串上使用帶有“單詞邊界”的 preg_match 並循環遍歷您的禁用單詞。

Q1，如何將頁面加載到字符串中？ 但是，我不知道如何開始。 因此，包括我在內的所有新手都會欣賞任何示例代碼。 歡迎任何代碼示例。

更新：我更新了我的代碼，插入了 miknik 的代碼。 它工作正常，直到我在 cURL 之前添加了這一行： $banned_words = array("Prick","Dick","***");

這是更新：

<?php

/*
ERROR HANDLING
*/

// 1). Set banned words.
$banned_words = array("Prick","Dick","***");

// 2). $curl is going to be data type curl resource.
$curl = curl_init();

// 3). Set cURL options.
curl_setopt($curl, CURLOPT_URL, 'https://www.buzzfeed.com/mjs538/the-68-
words-
you-cant-say-on-tv?utm_term=.xlN0R1Go89#.pbdl8dYm3X');
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true );

// 4). Run cURL (execute http request).
$result = curl_exec($curl);
$response = curl_getinfo( $curl );

if($response['http_code'] == '200' )
     {
          $regex = '/\b';      // The beginning of the regex string syntax
          $regex .= implode('\b|\b', $banned_words);      // joins all the 
          banned words to the string with correct regex syntax
          $regex .= '\b/i';    // Adds ending to regex syntax. Final i makes 
          it case insensitive
          $substitute = '****';
          $cleanresult = preg_replace($regex, $substitute, $result);
          echo $cleanresult;
     }

  curl_close($curl);

  ?>

Answer 1

您已經將頁面內容作為字符串，它在$result

preg_match將起作用，但是當您找到匹配項時，您想做什么？ preg_replace更適合過濾禁用詞。

無需將字符串分解為單個單詞，這樣做只會增加大量 CPU 開銷。 按原樣處理$result變量。

因此，首先從您的禁用單詞數組中構建一個正則表達式字符串。 匹配每個單詞的基本語法是\\bXXXX\\b ，其中 XXXX 是您的禁用詞。 \\b在每一端意味着它必須在一個單詞邊界，所以\\bcock\\b將匹配公雞和公雞！ 但不是公雞。

$regex = '/\b';      // The beginning of the regex string syntax
$regex .= implode('\b|\b', $banned_words);      // joins all the banned words to the string with correct regex syntax
$regex .= '\b/i';    // Adds ending to regex syntax. Final i makes it case insensitive

現在，您可以對$result運行單個操作，並獲得一個新的字符串，其中所有被禁止的單詞都被刪掉了。 設置您的值以替換每個禁用的單詞

$substitute = '****';

然后執行替換

$cleanresult = preg_replace($regex, $substitute, $result);

假設$result = 'You are a cock! You prick! You are such a dick.'; $result = 'You are a cock! You prick! You are such a dick.';

echo $cleanresult返回你是個****！ 你 ****！ 你真是個****。

在頁面上而不是在其他詞中查找被禁止的詞

問題描述

1 個解決方案

解決方案1
0 已采納 2017-10-04 01:22:53

在頁面上而不是在其他詞中查找被禁止的詞

問題描述

1 個解決方案

解決方案1 0 已采納 2017-10-04 01:22:53

解決方案1
0 已采納 2017-10-04 01:22:53