简体   繁体   中英

Finding Banned Words On A Page And Not Within Other Words

I am trying to add a banned words filter onto a web proxy. I am NOT searching for banned words within other words on a page but searching for banned words within a loaded page. I am not actually looking for banned words within other words but within the page (meta tags, content).

And so, if I am looking for the word "cock", then the word "cockerel" should not trigger the filter.

I just tested this code and, yes, as expected the code works but as you can guess there is a lot of cpu power cycling through. One moment the page loads, the other moment it goes grey and shows signs that the page is taking too long to load. And all this on localhost. Now, I can imagine what my webhost would do! So now, we will have to come-up with a better solution. Any ideas ? How-about we do not get the script to check on the loaded page for all the banned words ? How-about we get the script to halt as soon as 1 banned word is found and an echo has been made which banned word has been found and where on the page ? (meta tags, body content, etc.). Any code suggestions ?

Here is what I got so far:

<?php

/*
ERROR HANDLING
*/

// 1). $curl is going to be data type curl resource.
$curl = curl_init();

// 2). Set cURL options.
curl_setopt($curl, CURLOPT_URL, 'https://www.buzzfeed.com/mjs538/the-68-
words-you-cant-say-on-tv?utm_term=.xlN0R1Go89#.pbdl8dYm3X');
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true );

// 3). Run cURL (execute http request).
$result = curl_exec($curl);
$response = curl_getinfo( $curl );

if( $response['http_code'] == '200' )
    {
        //Set banned words.
        $banned_words = array("Prick","Dick","***");

        //Separate each words found on the cURL fetched page.
        $word = explode(" ", $result);

       //var_dump($word);

       for($i = 0; $i <= count($word); $i++)
       {
           foreach ($banned_words as $ban) 
           {
              if (strtolower($word[$i]) == strtolower($ban))
              {
                  echo "word: $word[$i]<br />";
                  echo "Match: $ban<br>";
           }
          else
           {
                 echo "word: $word[$i]<br />";
                 echo "No Match: $ban<br>";  
            }
         }
      }
   }  

// 4). Close cURL resource.
curl_close($curl);

I am told to do it like this:

Load the page into a string. Use preg_match with "word boundaries" on the loaded string and loop through your banned words.

Q1, How do I load the page into a string ? But, I have no clue how to start on this. And so, any sample code would be appreciated by all newbies including me. Any code samples welcome.

UPDATE: I updated my code inserting miknik's codes. It was working fine until I added this line before the cURL: $banned_words = array("Prick","Dick","***");

Here's the update:

<?php

/*
ERROR HANDLING
*/

// 1). Set banned words.
$banned_words = array("Prick","Dick","***");

// 2). $curl is going to be data type curl resource.
$curl = curl_init();

// 3). Set cURL options.
curl_setopt($curl, CURLOPT_URL, 'https://www.buzzfeed.com/mjs538/the-68-
words-
you-cant-say-on-tv?utm_term=.xlN0R1Go89#.pbdl8dYm3X');
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true );

// 4). Run cURL (execute http request).
$result = curl_exec($curl);
$response = curl_getinfo( $curl );

if($response['http_code'] == '200' )
     {
          $regex = '/\b';      // The beginning of the regex string syntax
          $regex .= implode('\b|\b', $banned_words);      // joins all the 
          banned words to the string with correct regex syntax
          $regex .= '\b/i';    // Adds ending to regex syntax. Final i makes 
          it case insensitive
          $substitute = '****';
          $cleanresult = preg_replace($regex, $substitute, $result);
          echo $cleanresult;
     }

  curl_close($curl);

  ?>

You have the page content as a string already, it's in $result

preg_match will work but what do you then want to do when you find a match? preg_replace is more appropriate if you want to filter the banned words.

There is no need to explode the string into individual words, you are just adding a lot of cpu overhead by doing so. Process the $result variable as is.

So first off construct a regex string from your array of banned words. A basic syntax for matching each word is \\bXXXX\\b where XXXX is your banned word. \\b at each end means that it must be at a word boundary, so \\bcock\\b would match cock and cock! but not cockerel.

$regex = '/\b';      // The beginning of the regex string syntax
$regex .= implode('\b|\b', $banned_words);      // joins all the banned words to the string with correct regex syntax
$regex .= '\b/i';    // Adds ending to regex syntax. Final i makes it case insensitive

Now you can run a single operation on $result and get a new string with all the banned words censored. Set your value to be substituted for each banned word

$substitute = '****';

Then perform the replacement

$cleanresult = preg_replace($regex, $substitute, $result);

Assuming $result = 'You are a cock! You prick! You are such a dick.'; $result = 'You are a cock! You prick! You are such a dick.';

echo $cleanresult returns You are a ****! You ****! You are such a ****.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM