For a homework assignment, I have to get all the .htm and .html files in the current and all sub directories, and I have to index them by counting all the words that appear in the files individually.
Here is how I would count the file once I find an html file in a directory:
$file = '.html';
$index = indexer($file);
echo '<pre>'.print_r($index,true).'</pre>';
function indexer($file) {
$index = array();
$find = array('/\r/','/\n/','/\t/','!',',','.','"',';', ':');
$replace = array(' ',' ',' ',' ',' ',' ',' ',' ',' ');
$string = file_get_contents($file);
$string = strip_tags($string);
$string = strtolower($string);
$string = str_replace($find, $replace, $string);
$string = trim($string);
$string = explode(' ', $string);
natcasesort($string);
$i = 0;
foreach($string as $word) {
$word = trim($word);
$ignore = preg_match('/[^a-zA-Z]/', $word);
if($ignore == 1) {
$word = '';
}
if( (!empty($word)) && ($word != '') ) {
if(!isset($index[$i]['word'])) {
$index[$i]['word'] = $word;
$index[$i]['count'] = 1;
} elseif( $index[$i]['word'] == $word ) {
$index[$i]['count'] += 1;
} else {
$i++;
$index[$i]['word'] = $word;
$index[$i]['count'] = 1;
}
}
}
unset($work);
return($index);
}
I just need to figure out first how to find all the htm or html files in the directories and then start using the above code on each htm/html file. Any help will be appreciated, thanks!
Well, because this is a homework assignment, I won't give you the code. But I can point you in the right direction. Usually for this type of thing, people with use a recursive function. Where a function calls itself.
This function should do the following:
The RecursiveDirectoryIterator is the best class in PHP to do this. It's flexible and fast.
Other alternative methods (not recursive) are described in " Directory to array with PHP ". In my answer to that question, I timed the different methods given by other answers, but all of the solutions in PHP code are slower than using the PHP's SPL classes.
Try Using glob function.
$files = glob('*.htm*');
foreach($files as $file) {
//code here
}
Edited:
function readDir($path) {
$files = glob($path . '*.*');
foreach ($files as $file) {
if (is_dir($file)) {
$html_files = array_merge((array) readDir($file . '/'), (array) $html_files);
}
if (in_array(strtolower(end(explode('.', $file))), array('html', 'htm'))) {
$html_files[] = $file;
}
}
return $html_files;
}
Just edited the answer, Try this. (Note: I haven't Not tested the code on any site.) Thanks
Here's an alternative using RecursiveIteratorIterator
, RecursiveDirectoryIterator
and pathinfo()
.
<?php
$dir = '/';
$iterator = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($dir), RecursiveIteratorIterator::CHILD_FIRST);
foreach ( $iterator as $path )
if ( $path->isFile() && preg_match('/^html?$/i', pathinfo($path->getFilename(), PATHINFO_EXTENSION)) )
echo $path->getPathname() . PHP_EOL;
If you need to get the current working directory, you can use getcwd()
(ie $dir = getcwd();
).
To get the length of the content, you can do a few things. You could retrieve the contents of the file using file_get_contents
and use strlen
to calculate the length or str_word_count
to count the words. Another option could be to use $path->getSize()
.
If you use an array to store the names and the sizes, you can then use a custom function and uasort
to sort the array by sizes.
A more complete example:
<?php
function sort_by_size($a, $b)
{
if ( $a['size'] == $b['size'] )
return 0;
return ( $a['size'] < $b['size'] ? -1 : 1 );
}
$dir = '/';
$files = array();
$iterator = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($dir), RecursiveIteratorIterator::CHILD_FIRST);
foreach ( $iterator as $path )
if ( $path->isFile() && preg_match('/^html?$/i', pathinfo($path->getFilename(), PATHINFO_EXTENSION)) )
$files[] = array(
'name' => $path->getPathname(),
'size' => $path->getSize()
);
uasort($files, sort_by_size);
The $files
array can then be looped through using a foreach
loop. It will contain both the pathname and the size.
Do you have any restrictions on the functions/classes you can use? If not, then check out RecursiveDirectoryIterator
it will let you go through dirs recursively iterating over all the items in the directory. You could then match the extension on each item and if it matches basically do your counting.
An alternative approach to this would be to use glob
while iterating over the directories which allows you to do a *.html
search like you would use with with the *nix utility find
.
As far as counting you might want to take look at str_word_count
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.