简体   繁体   中英

Count phrases in a string in PHP

I'm currently grabbing a HTML page and counting the single words on the page using:

$page_content = file_get_html($url)->plaintext;

$word_array = array_count_values(str_word_count(strip_tags(strtolower($page_content));

Which works great for counting single words.

But I'm trying to count phrases of up to about 3 words.

For example:

$string = 'the best stack post';

The count would return:

the = 1
best = 1
stack = 1
post = 1

I need phrases to pulled out of the string, so a three word phrase from that string could be:

the best stack = 1
best stack post = 1

I hope that makes sense!

I've searched but cannot find any way to do this in PHP.

Any ideas?

What I would do is get the page content and remove html tags. Then explode the text by the typical phrase separator, which is the dot (.). Now you have an array of single phrases, for which you can count single words:

$page_content = file_get_html($url)->plaintext;
$text = strip_tags(strtolower($page_content));

$phrases = explode(".", $text);

$count = 0;
foreach ($phrases as $phrase) {
    if (str_word_count($phrase) >= 3) {
        $count++;
    }
}

So there's two steps to this solution.

  1. There's a function that gets all 3 word phrases from a string (ignoring any full-stops)
  2. The main function will use the previous function on every sentence (terminated by . ).

Here's the code:

function threeWords($string) {
      $words = array_values(array_filter(preg_split("!\W!",$string))); //Split on non-word characters. Not ideal probably since it will count "non-hyphenated" as 2 words. 
      if (count($words) < 3) { return []; }
      $phrases = [];
      for ($i = 2;$i < count($words);$i++) {
           $phrases[] = $words[$i-2]." ".$words[$i-1]." ".$words[$i];
      }
      return $phrases;
}

$page_content = file_get_html($url)->plaintext;
$text = strip_tags(strtolower($page_content));
$sentences = explode(".",$text);
$phrases = [];
foreach ($sentences as $sentence) {
   $phrases = array_merge($phrases,threeWords(trim($sentence)));
}
$count = array_count_values($phrases);
print_r($count);
// Split the string into sentences on the appropriate punctuation marks
// and loop over the sentences
foreach (preg_split('/[?.!]/', $string) as $sentence) {

    // split the sentences into words (remove any empty strings with array_filter)
    $words = array_filter(explode(' ', $sentence));

    // take the first set of three words from the sentence, then remove the first word,
    // until the sentence is gone.
    while ($words) {
        $phrase = array_slice($words, 0, 3);

        // check that the phrase is the correct length            
        if (count($phrase) == 3) {

            // convert it back to a string
            $phrase = implode(' ', $phrase);

            // increment the count for that phrase in your result
            if (!isset($phrases[$phrase])) $phrases[$phrase] = 0;
            $phrases[$phrase]++;
        }
        array_shift($words);
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM