I am writing a Korn shell script. I have two arrays (say, arr1
and arr2
), both containing strings, and I need to check which elements from arr1
are present (as whole strings or substrings) in arr2
. The most intuitive solution is having nested for loops, and checking if each element from arr1
can be found in arr2
(through grep
) like this:
for arr1Element in ${arr1[*]}; do
for arr2Element in ${arr2[*]}; do
# using grep to check if arr1Element is present in arr2Element
echo $arr2Element | grep $arr1Element
done
done
The issue is that arr2
has around 3000 elements, so running a nested loop takes a long time. I am wondering if there is a better way to do this in Bash.
If I were doing this in Java, I could have calculated hashes for elements in one of the arrays, and then looked for those hashes in the other array, but I don't think Bash has any functionality for doing something like this (unless I was willing to write a hash calculating function in Bash).
Any suggestions?
Since version 4.0 Bash has associative arrays:
$ declare -A elements
$ elements[hello]=world
$ echo ${elements[hello]}
world
You can use this in the same way you would a Java Map.
declare -A map
for el in "${arr1[@]}"; do
map[$el]="x"
done
for el in "${arr2[@]}"; do
if [ -n "${map[$el]}" ] ; then
echo "${el}"
fi
done
Dealing with substrings is an altogether more weighty problem, and would be a challenge in any language, short of the brute-force algorithm you're already using. You could build a binary-tree index of character sequences, but I wouldn't try that in Bash!
Since you're OK with using grep
, and since you want to match substrings as well as full strings, one approach is to write:
printf '%s\n' "${arr2[@]}" \
| grep -o -F "$(printf '%s\n' "${arr1[@]}")
and let grep
optimize as it sees fit.
BashFAQ #36 describes doing set arithmetic (unions, disjoint sets, etc) in bash with comm
.
Assuming your values can't contain literal newlines, the following will emit a line per item in both arr1 and arr2:
comm -12 <(printf '%s\n' "${arr1[@]}" | sort -u) \
<(printf '%s\n' "${arr2[@]}" | sort -u)
If your arrays are pre-sorted, you can remove the sort
s (which will make this extremely memory- and time-efficient with large arrays, moreso than the grep
-based approach).
Here's a bash/awk
idea:
# some sample arrays
$ arr1=( my first string "hello wolrd")
$ arr2=( my last stringbean strings "well, hello world!)
# break array elements into separate lines
$ printf '%s\n' "${arr1[@]}"
my
first
string
hello world
$ printf '%s\n' "${arr2[@]}"
my
last
stringbean
strings
well, hello world!
# use the 'printf' command output as input to our awk command
$ awk '
NR==FNR { a[NR]=$0 ; next }
{ for (i in a)
if ($0 ~ a[i]) print "array1 string {"a[i]"} is a substring of array2 string {"$0"}" }
' <( printf '%s\n' "${arr1[@]}" ) \
<( printf '%s\n' "${arr2[@]}" )
array1 string {my} is a substring of array2 string {my}
array1 string {string} is a substring of array2 string {stringbean}
array1 string {string} is a substring of array2 string {strings}
array1 string {hello world} is a substring of array2 string {well, hello world!}
NR==FNR
: for file #1 only: store elements into awk array named 'a' next
: process next line in file #1; at this point rest of awk script is ignored for file #1; the for each line in file #2 ...for (i in a)
: for each index 'i' in array 'a' ... if ($0 ~ a[i] )
: see if a[i] is a substring of the current line ($0) from file #2 and if so ... print "array1...
: output info about the match A test run using the following data:
arr1 == 3300 elements
arr2 == 500 elements
When all arr2
elements have a substring/pattern match in arr1
(ie, 500 matches), total time to run is ~27 seconds ... so the repetitive looping through the array takes a toll.
Obviously (?) need to reduce the volume of repetitive actions ...
comm
solution by Charles Duffy makes sense (it runs against the same 3300/500 test set in about 0.5 seconds)egrep
solution to run in about 5 seconds (see my other answer/post)An egrep
solution for substring/pattern matching ...
egrep -f <(printf '.*%s.*\n' "${arr1[@]}") \
<(printf '%s\n' "${arr2[@]}")
egrep -f
: take patterns to search from the file designated by the -f
, which in this case is ... <(printf '.*%s.*\\n' "${arr1[@]}")
: convert arr1
elements into 1 pattern per line, appending a regex wild card character (.*) for prefix and suffix <(printf '%s\\n' "${arr2[@]}")
: convert arr2
elements into 1 string per line When run against a sample data set like:
arr1 == 3300 elements
arr2 == 500 elements
... with 500 matches, total run time is ~5 seconds; there's still a good bit of repetitive processing going on with egrep
but not as bad as seen with my other answer ( bash/awk
) ... and of course not as fast the comm
solution which eliminates the repetitive processing.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.