grep for two patterns independently (in different lines)

Question

I have some directories with the following structure:

DAY1/ # Files under this directory should have DAY1 in the name.
|-- Date
|   |-- dir1 # Something wrong here, there are files with DAY2 and files with DAY1.
|   |-- dir2
|   |-- dir3
|   |-- dir4
DAY2/ # Files under this directory should all have DAY2 in the name.
|-- Date
|   |-- dir1
|   |-- dir2 # Something wrong here, there are files with DAY2, and files with DAY1.
|   |-- dir3
|   |-- dir4

In each dir there are hundreds of thousands of files with names containing DAY , for example 0.0000.DAY1.01927492 . Files with DAY1 on the name should only appear under parent directory DAY1 .

Something went wrong when copying files around, so that I now have mixed files with DAY1 and DAY2 in some of the dir directories.

I wrote a script to find folders that contain mixed files, so I can then look at them more closely. My script is the following:

for directory in */; do
    if ls $directory | grep -q DAY2 ; then
        if ls $directory | grep -q DAY1; then 
              echo "mixed files in $directory";
        fi ; 
    fi; 
done

The problem here is that I'm going through all files twice, which doesn't make sense considering that I'd only have to look through the files once.

What would be a more efficient way achieve what I want?

Answer 1

If i understand you correctly, then you need to find the files under DAY1 directory recursively that have DAY2 in their names, similarly for DAY2 directory the files what have DAY1 in their names.

If so, for DAY1 directory:

find DAY1/ -type f -name '*DAY2*'

this will get you the files under DAY1 directory that have DAY2 in their names. Similarly for DAY2 directory:

find DAY2/ -type f -name '*DAY1*'

Both are recursive operations.

To get the directory names only:

find DAY1/ -type f -name '*DAY2*' -exec dirname {} +

Note that the $PWD will be shown as . .

To get uniqueness, pass the output to sort -u :

find DAY1/ -type f -name '*DAY2*' -exec dirname {} + | sort -u

Answer 2

Given that the difference between going through them once and going through them twice is just a factor-of-two difference, changing to an approach that goes through them only once might actually not be a win, since the new approach might easily take twice as long per file.

So you'll definitely want to experiment; it's not necessarily something that you can confidently reason about.

However, I will say that in addition to going through the files twice, the ls version also sorts the files, which probably has a more-than-linear cost (unless it's doing some kind of bucket-sort). Eliminating that, by writing ls --sort=none instead of just ls , will actually improve your algorithmic complexity, and is almost certain to give a tangible improvement.

But FWIW, here's a version that only goes through the files once, that you can try:

for directory in */; do
  find "$directory" -maxdepth 1 \( -name '*DAY1*' -or -name '*DAY2*' \) -print0 \
  | { saw_day1=
      saw_day2=
      while IFS= read -d '' subdirectory ; do
        if [[ "$subdirectory" == *DAY1* ]] ; then
          saw_day1=1
        fi
        if [[ "$subdirectory" == *DAY2* ]] ; then
          saw_day2=1
        fi
        if [[ "$saw_day1" ]] && [[ "$saw_day2" ]] ; then
          echo "mixed files in $directory"
          break
        fi
      done
    }
done

grep for two patterns independently (in different lines)

Question

2 answers

solution1
2 ACCPTED 2016-07-28 14:25:34

solution2
1 2016-07-28 14:40:35

grep for two patterns independently (in different lines)

Question

2 answers

solution1 2 ACCPTED 2016-07-28 14:25:34

solution2 1 2016-07-28 14:40:35

solution1
2 ACCPTED 2016-07-28 14:25:34

solution2
1 2016-07-28 14:40:35