Finding elements in common between two ksh or bash arrays efficiently

Question

I am writing a Korn shell script. I have two arrays (say, arr1 and arr2 ), both containing strings, and I need to check which elements from arr1 are present (as whole strings or substrings) in arr2 . The most intuitive solution is having nested for loops, and checking if each element from arr1 can be found in arr2 (through grep ) like this:

for arr1Element in ${arr1[*]}; do
    for arr2Element in ${arr2[*]}; do
        # using grep to check if arr1Element is present in arr2Element
        echo $arr2Element | grep $arr1Element
    done
done

The issue is that arr2 has around 3000 elements, so running a nested loop takes a long time. I am wondering if there is a better way to do this in Bash.

If I were doing this in Java, I could have calculated hashes for elements in one of the arrays, and then looked for those hashes in the other array, but I don't think Bash has any functionality for doing something like this (unless I was willing to write a hash calculating function in Bash).

Any suggestions?

Answer 1

Since version 4.0 Bash has associative arrays:

$ declare -A elements
$ elements[hello]=world
$ echo ${elements[hello]}
world

You can use this in the same way you would a Java Map.

declare -A map
for el in "${arr1[@]}"; do 
    map[$el]="x"
done

for el in "${arr2[@]}"; do 
    if [ -n "${map[$el]}" ] ; then 
       echo "${el}"
    fi
done

Dealing with substrings is an altogether more weighty problem, and would be a challenge in any language, short of the brute-force algorithm you're already using. You could build a binary-tree index of character sequences, but I wouldn't try that in Bash!

Answer 2

Since you're OK with using grep , and since you want to match substrings as well as full strings, one approach is to write:

printf '%s\n' "${arr2[@]}" \
  | grep -o -F "$(printf '%s\n' "${arr1[@]}")

and let grep optimize as it sees fit.

Answer 3

BashFAQ #36 describes doing set arithmetic (unions, disjoint sets, etc) in bash with comm .

Assuming your values can't contain literal newlines, the following will emit a line per item in both arr1 and arr2:

comm -12 <(printf '%s\n' "${arr1[@]}" | sort -u) \
         <(printf '%s\n' "${arr2[@]}" | sort -u)

If your arrays are pre-sorted, you can remove the sort s (which will make this extremely memory- and time-efficient with large arrays, moreso than the grep -based approach).

Answer 4

Here's a bash/awk idea:

# some sample arrays

$ arr1=( my first string "hello wolrd")
$ arr2=( my last  stringbean strings "well, hello world!)

# break array elements into separate lines

$ printf '%s\n' "${arr1[@]}"
my
first
string
hello world

$ printf '%s\n' "${arr2[@]}"
my
last
stringbean
strings
well, hello world!

# use the 'printf' command output as input to our awk command

$ awk '
NR==FNR { a[NR]=$0 ; next }
{ for (i in a)
      if ($0 ~ a[i]) print "array1 string {"a[i]"} is a substring of array2 string {"$0"}" }
' <( printf '%s\n' "${arr1[@]}" ) \
  <( printf '%s\n' "${arr2[@]}" )

array1 string {my} is a substring of array2 string {my}
array1 string {string} is a substring of array2 string {stringbean}
array1 string {string} is a substring of array2 string {strings}
array1 string {hello world} is a substring of array2 string {well, hello world!}

NR==FNR : for file #1 only: store elements into awk array named 'a'
next : process next line in file #1; at this point rest of awk script is ignored for file #1; the for each line in file #2 ...
for (i in a) : for each index 'i' in array 'a' ...
if ($0 ~ a[i] ) : see if a[i] is a substring of the current line ($0) from file #2 and if so ...
print "array1... : output info about the match

A test run using the following data:

arr1 == 3300 elements
arr2 ==  500 elements

When all arr2 elements have a substring/pattern match in arr1 (ie, 500 matches), total time to run is ~27 seconds ... so the repetitive looping through the array takes a toll.

Obviously (?) need to reduce the volume of repetitive actions ...

for an exact string match the comm solution by Charles Duffy makes sense (it runs against the same 3300/500 test set in about 0.5 seconds)
for a substring/pattern match I was able to get a egrep solution to run in about 5 seconds (see my other answer/post)

Answer 5

An egrep solution for substring/pattern matching ...

egrep -f <(printf '.*%s.*\n' "${arr1[@]}") \
         <(printf '%s\n'     "${arr2[@]}")

egrep -f : take patterns to search from the file designated by the -f , which in this case is ...
<(printf '.*%s.*\\n' "${arr1[@]}") : convert arr1 elements into 1 pattern per line, appending a regex wild card character (.*) for prefix and suffix
<(printf '%s\\n' "${arr2[@]}") : convert arr2 elements into 1 string per line

When run against a sample data set like:

arr1 == 3300 elements
arr2 ==  500 elements

... with 500 matches, total run time is ~5 seconds; there's still a good bit of repetitive processing going on with egrep but not as bad as seen with my other answer ( bash/awk ) ... and of course not as fast the comm solution which eliminates the repetitive processing.

Finding elements in common between two ksh or bash arrays efficiently

Question

5 answers

solution1
3 2017-11-28 16:55:52

solution2
2 2017-11-28 17:27:31

solution3
2 2017-11-28 17:28:46

solution4
0 2017-11-28 18:21:24

solution5
0 2017-11-28 21:33:55

Finding elements in common between two ksh or bash arrays efficiently

Question

5 answers

solution1 3 2017-11-28 16:55:52

solution2 2 2017-11-28 17:27:31

solution3 2 2017-11-28 17:28:46

solution4 0 2017-11-28 18:21:24

solution5 0 2017-11-28 21:33:55

solution1
3 2017-11-28 16:55:52

solution2
2 2017-11-28 17:27:31

solution3
2 2017-11-28 17:28:46

solution4
0 2017-11-28 18:21:24

solution5
0 2017-11-28 21:33:55