简体   繁体   中英

PHP how to get all files(only html files) in all subdirectories and index each html page

For a homework assignment, I have to get all the .htm and .html files in the current and all sub directories, and I have to index them by counting all the words that appear in the files individually.

Here is how I would count the file once I find an html file in a directory:

$file = '.html';
$index = indexer($file);
echo '<pre>'.print_r($index,true).'</pre>';

function indexer($file) {
    $index = array();
    $find = array('/\r/','/\n/','/\t/','!',',','.','"',';',                           ':');
    $replace = array(' ',' ',' ',' ',' ',' ',' ',' ',' ');
    $string = file_get_contents($file);
    $string = strip_tags($string);
    $string = strtolower($string);
    $string = str_replace($find, $replace, $string);
    $string = trim($string);
    $string = explode(' ', $string);
    natcasesort($string);
    $i = 0;
    foreach($string as $word) {
        $word = trim($word);
        $ignore = preg_match('/[^a-zA-Z]/', $word);
        if($ignore == 1) {
            $word = '';
        }
        if( (!empty($word)) && ($word != '') ) {
            if(!isset($index[$i]['word'])) {
                $index[$i]['word'] = $word;
                $index[$i]['count'] = 1;
            } elseif( $index[$i]['word'] == $word ) {
                $index[$i]['count'] += 1;
            } else {
                $i++;
                $index[$i]['word'] = $word;
                $index[$i]['count'] = 1;
            }
        }
    }
    unset($work);
    return($index);
}

I just need to figure out first how to find all the htm or html files in the directories and then start using the above code on each htm/html file. Any help will be appreciated, thanks!

Well, because this is a homework assignment, I won't give you the code. But I can point you in the right direction. Usually for this type of thing, people with use a recursive function. Where a function calls itself.

This function should do the following:

  • Count all the lines of all the htm, and html files in the current directory.
  • Add these numbers up, and then add them to a global variable outside the function (just use global, you could return the number of lines each call, and add them up, but that is a pain in the butt)
  • call this function again for every folder in the current directory (just loop through them)
  • once you are back at the very start, reset the global variable, and return its value

The RecursiveDirectoryIterator is the best class in PHP to do this. It's flexible and fast.

Other alternative methods (not recursive) are described in " Directory to array with PHP ". In my answer to that question, I timed the different methods given by other answers, but all of the solutions in PHP code are slower than using the PHP's SPL classes.

Try Using glob function.

$files = glob('*.htm*');
foreach($files as $file) {
//code here
}

Edited:

    function readDir($path) {
  $files = glob($path . '*.*');

  foreach ($files as $file) {
    if (is_dir($file)) {
      $html_files = array_merge((array) readDir($file . '/'), (array) $html_files);
    }

    if (in_array(strtolower(end(explode('.', $file))), array('html', 'htm'))) {
      $html_files[] = $file;
    }
  }

  return $html_files;
}

Just edited the answer, Try this. (Note: I haven't Not tested the code on any site.) Thanks

Here's an alternative using RecursiveIteratorIterator , RecursiveDirectoryIterator and pathinfo() .

<?php

$dir = '/';

$iterator = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($dir), RecursiveIteratorIterator::CHILD_FIRST);

foreach ( $iterator as $path )
  if ( $path->isFile() && preg_match('/^html?$/i', pathinfo($path->getFilename(), PATHINFO_EXTENSION)) )
    echo $path->getPathname() . PHP_EOL;

If you need to get the current working directory, you can use getcwd() (ie $dir = getcwd(); ).

To get the length of the content, you can do a few things. You could retrieve the contents of the file using file_get_contents and use strlen to calculate the length or str_word_count to count the words. Another option could be to use $path->getSize() .

If you use an array to store the names and the sizes, you can then use a custom function and uasort to sort the array by sizes.

A more complete example:

<?php

function sort_by_size($a, $b)
{
  if ( $a['size'] == $b['size'] )
    return 0;

  return ( $a['size'] < $b['size'] ? -1 : 1 );
}

$dir = '/';
$files = array();

$iterator = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($dir), RecursiveIteratorIterator::CHILD_FIRST);

foreach ( $iterator as $path )
  if ( $path->isFile() && preg_match('/^html?$/i', pathinfo($path->getFilename(), PATHINFO_EXTENSION)) )
    $files[] = array(
      'name' => $path->getPathname(),
      'size' => $path->getSize()
    );

uasort($files, sort_by_size);

The $files array can then be looped through using a foreach loop. It will contain both the pathname and the size.

Do you have any restrictions on the functions/classes you can use? If not, then check out RecursiveDirectoryIterator it will let you go through dirs recursively iterating over all the items in the directory. You could then match the extension on each item and if it matches basically do your counting.

An alternative approach to this would be to use glob while iterating over the directories which allows you to do a *.html search like you would use with with the *nix utility find .

As far as counting you might want to take look at str_word_count .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM