简体   繁体   English

从目录中的 all.txt 文件中获取唯一单词的计数

[英]Get the count of unique words from all .txt files in a directory

I have a directory of text files.我有一个文本文件目录。 I want to loop through each of the text files in the directory and get the overall count of unique words (count of vocabulary), not for each individual file, but for ALL the files together.我想遍历目录中的每个文本文件并获取唯一单词的总数(词汇计数),不是针对每个单独的文件,而是针对所有文件。 In other words, I want the number of unique words within all the files together, and NOT the number of unique words for each individual file.换句话说,我想要所有文件中唯一单词的数量,而不是每个文件的唯一单词数量。

For example, I have three text files in a directory.例如,我在一个目录中有三个文本文件。 Here are their contents:以下是他们的内容:

file1.txt -> here is some text. file1.txt -> here is some text.

file2.txt -> here is more text. file2.txt -> here is more text.

file3.txt -> even more text. file3.txt -> even more text.

So the count of unique words for this directory of text files in this case is 6.因此,在这种情况下,此文本文件目录的唯一字数为 6。

I have tried to use this code:我曾尝试使用此代码:

$files = glob("C:\\wamp\\dir");

$out = fopen("mergedFiles.txt", "w");


  foreach($files as $file){
      $in = fopen($file, "r");
      while ($line = fread($in)){
           fwrite($out, $line);
      }
      fclose($in);
  }


  fclose($out);

to merge all the text files and then after using this code I planned to use the array_unique() on mergedFiles.txt.合并所有文本文件,然后在使用此代码后,我计划在mergedFiles.txt 上使用array_unique()。 However, the code is not working.但是,代码不起作用。

How can I get the unique word count of all the text files in the directory in the best way possible?如何以最佳方式获取目录中所有文本文件的唯一字数?

You can try this : 您可以尝试以下方法:

$allWords = array();

foreach (glob("*.txt") as $filename) // loop on each file
{
    $contents = file_get_contents($filename); // Get file contents
    $words = explode(' ', $contents); // Make an array with words

    if ( $words )
        $allWords = array_merge($allWords, $words); // combine global words array and file words array
}

var_dump(count(array_unique($allWords)));

EDIT Other version which : 编辑其他版本:

  • remove dots 删除点
  • remove multiple spaces 删除多个空格
  • match word if missing space between end of sentence and new one. 如果句子结尾和新句子之间缺少空格,则匹配单词。

function removeDot($string) {
    return rtrim($string, '.');
}

$words = explode(' ', preg_replace('#\.([a-zA-Z])#', '. $1', preg_replace('/\s+/', ' ',$contents)));
$words = array_map("removeDot", $words);

Unless you have legitimate reasons not to simply concatenate the files and process their content as a concatenated string, use this snippet to target txt files in a directory, join their texts, make the text lowercase, isolate words, remove duplicates, then count unique words:除非您有正当理由不简单地连接文件并将其内容作为连接字符串处理,否则请使用此代码段来定位目录中的 txt 文件,连接其文本,使文本小写,隔离单词,删除重复项,然后计算唯一单词:

Code (not fully tested on a filesystem): ( Demo )代码(未在文件系统上完全测试):(演示

echo count(
    array_unique(
        str_word_count(
            strtolower(
                implode(
                    ' ',
                    array_map(
                        'file_get_contents',
                        glob("*.txt")
                    )
                )
            ),
            1
        )
    )
);

Assuming texts from file:假设文件中的文本:

[
    'here is some text.',
    'here is more text.',
    'even more text.'
]

The output is 6 from a unique array of: output 是6 ,来自一个独特的数组:

array (
  0 => 'here',
  1 => 'is',
  2 => 'some',
  3 => 'text',
  6 => 'more',
  8 => 'even',
)

Modify the snippet as needed: perhaps use a different technique/algorithm to identify "words", or use mb_strtolower() , or don't use strtolower() at all.根据需要修改片段:可能使用不同的技术/算法来识别“单词”,或者使用mb_strtolower() ,或者根本不使用strtolower()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM