简体   繁体   中英

Find folders that contain multiple matches to a regex/grep

I have a folder structure encompassing many thousands of folders. I would like to be able to find all the folders that, for example, contain multiple .txt files, or multiple .jpeg, or whatever without seeing any folders that contain only a single file of that kind.

The folders should all have only one file of a specific type, but this is not always the case and it is tedious to try to find them.

Note that the folders may contain many other files.

If possible, I'd like to match "FILE.JPG" and "file.jpg" as both matching a query on "file" or "jpg".

What I have been doing in simply find . -iname "*file*" find . -iname "*file*" and going through it manually.

folders contain folders, sometimes 3 or 4 levels deep

first/
  second/
     README.txt
     readme.TXT
     readme.txt
     foo.txt
   third/
     info.txt
   third/fourth/
     raksljdfa.txt

Should return

first/second/README.txt
first/second/readme.TXT
first/second/readme.txt
first/secondfoo.txt```

when searching for "txt"

and

first/second/README.txt
first/second/readme.TXT
first/second/readme.txt

when searching for "readme"

Something like this sounds like what you want:

find . -type f -print0 |
awk -v re='[.]txt$' '
BEGIN {
    RS = "\0"
    IGNORECASE = 1
}
{
    dir  = gensub("/[^/]+$","",1,$0)
    file = gensub("^.*/","",1,$0)
}
file ~ re {
    dir2files[dir][file]
}
END {
    for (dir in dir2files) {
        if ( length(dir2files[dir]) > 1 ) {
            for (file in dir2files[dir]) {
                print dir "/" file
            }
        }
    }
}'

It's untested but should be close. It uses GNU awk for gensub(), IGNORECASE, true multi-dimensional arrays and length(array).

This pure Bash code should do it (with caveats, see below):

#! /bin/bash

fileglob=$1             # E.g. '*.txt' or '*readme*'

shopt -s nullglob       # Expand to nothing if nothing matches
shopt -s dotglob        # Match files whose names start with '.'
shopt -s globstar       # '**' matches multiple directory levels
shopt -s nocaseglob     # Ignore case when matching

IFS=                    # Disable word splitting

for dir in **/ ; do
    matching_files=( "$dir"$fileglob )
    (( ${#matching_files[*]} > 1 )) && printf '%s\n' "${matching_files[@]}"
done

Supply the pattern to be matched as an argument to the program when you run it. Eg

myprog '*.txt'
myprog '*readme*'

(The quotes on the patterns are necessary to stop them matching files in the current directory.)

The caveats regarding the code are:

  1. globstar was introduced with Bash 4.0. The code won't work with older Bash.
  2. Prior to Bash 4.3, globstar matches followed symlinks. This could lead to duplicate outputs, or even failures due to circular links.
  3. The **/ pattern expands to a list of all the directories in the hierarchy. This could take an excessively long time or use an excessive amount of memory if the number of directories is large (say, greater than ten thousand).

If your Bash is older than 4.3, or you have large numbers of directories, this code is a better option:

#! /bin/bash

fileglob=$1             # E.g. '*.txt' or '*readme*'

shopt -s nullglob       # Expand to nothing if nothing matches
shopt -s dotglob        # Match files whose names start with '.'
shopt -s nocaseglob     # Ignore case when matching

IFS=                    # Disable word splitting

find . -type d -print0 \
    |   while read -r -d '' dir ; do
            matching_files=( "$dir"/$fileglob )
            (( ${#matching_files[*]} > 1 )) \
                && printf '%s\n' "${matching_files[@]}"
        done

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM