简体   繁体   中英

How can I find out the most used 2 words combo in a block of text?

How can I find out what the most common two words that I used right after each other are from a block of text? In other words is there a tool online or offline (or code) where I can copy and paste text and it outputs my most used two word frequency like:

From most used to least:

"the cat" 2.9% "she said" 1.8% "went to" 1.2%

Thanks

  1. Chunk up the text into two word pairs (use substr and strpos to help you)

    • find the second index of a space, using strpos, and then substring between the beginning and the second space index to get the two word pair.
  2. Add each pair to a map or set (the pair will be the key) and set the value (if it already exists in the map, increment the value
  3. Once you have parsed the full text, calculate the percentages based on the size of the map/set and the value for each pair.

This was fun, but i had a little go at it, this should get you started and should not be your answer.

This is basically grouping the words into 2's, indexing them into an array and incrementing the times there found, and finally converting into a percentage :)

$data = 'In the first centuries of typesetting, quotations were distinguished merely by indicating the speaker, and this can still be seen in some editions of the Bible. During the Renaissance, quotations were distinguished by setting in a typeface contrasting with the main body text (often Italic type with roman, or the other way round). Block quotations were set this way at full size and full measure.
Quotation marks were first cut in type during the middle of the sixteenth century, and were used copiously by some printers by the seventeenth. In Baroque and Romantic-period books, they could be repeated at the beginning of every line of a long quotation. When this practice was abandoned, the empty margin remained, leaving an indented block quotation';

//Clean The Data from un required chars!
$data = preg_replace("/[^\w]/"," ",$data);

$segments = explode(" ",$data);
$indexes = array();

for($i=0;$i<count($segments);$i++)
{
   if($i == 0)
   {
      continue;
   }

   if(trim($segments[$i - 1]) != "" && trim($segments[$i]) != "")
   {
      $key = trim($segments[$i - 1]) . " " . trim($segments[$i]);
      if(array_key_exists($key,$indexes))
      {
          $indexes[$key]["count"]++;
      }else
      {
          $indexes[$key] = array(
              'count' => 1,
              'words' => $key
          );
      }
   }
}

//Change to the percentage:
$total_double_words = count($segments);
foreach($indexes as $id => $set)
{
    $indexes[$id]['percentage'] = number_format((($set['count']/ $total_double_words) * 100),2) . "%";
}

var_dump($indexes);

You can see it live here: http://codepad.org/rcwpddW8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM