How to get Jaccard similarity between two array columns in a table in Snowflake

Question

I'm basing my question is this previous one, that missed sample data and desired results:

How to perform Jaccard similarity between two array columns in a table in Snowflake

I want to write a UDF in Snowflake that can compute Jaccard similarity between two arrays:

with data as (
    select [1,2,3,4] a, [1,2,3,5] b
    union all select [20,30,90], [20,40,90]
)

select jaccard_sim(a, b)
from data

The desired results are 0.6 and 0.5, for the previous two examples.

Definition: https://en.wikipedia.org/wiki/Jaccard_index

Answer 1

I wrote a JS UDF to perform the desired computation:

create or replace function jaccard_sim(A array, B array)
returns string
language javascript
as $$
var union = new Set([...A, ...B]).size;

var intersection = new Set(
  Array.from(new Set(A)).filter(x => new Set(B).has(x))
).size;

return intersection/union

$$;

With this, select jaccard_sim(a, b) from data will work as expected.

I got the set operations for JS from https://exploringjs.com/impatient-js/ch_sets.html#union-ab .

The UDF above solves the problem. As a bonus, this is how the native Snowflake approximate_similarity / approximate_jaccard_index works:

with data as (
    select [1,2,3,4] a, [1,2,3,5] b
    union all select [20,30,90], [20,40,90]
)


select approximate_similarity(mh), seq, array_agg(arr)
from (
    select minhash(1023, value) mh, seq, any_value(a) arr
    from data, table(flatten(a))
    group by seq
    union all
    select minhash(1023, value) mh, seq, any_value(b) arr
    from data, table(flatten(b))
    group by seq
)
group by seq

How to get Jaccard similarity between two array columns in a table in Snowflake

Question

1 answers

solution1
0 2022-07-26 01:00:48

How to get Jaccard similarity between two array columns in a table in Snowflake

Question

1 answers

solution1 0 2022-07-26 01:00:48

solution1
0 2022-07-26 01:00:48