简体   繁体   中英

How to get Jaccard similarity between two array columns in a table in Snowflake

I'm basing my question is this previous one, that missed sample data and desired results:

I want to write a UDF in Snowflake that can compute Jaccard similarity between two arrays:

with data as (
    select [1,2,3,4] a, [1,2,3,5] b
    union all select [20,30,90], [20,40,90]
)

select jaccard_sim(a, b)
from data

The desired results are 0.6 and 0.5, for the previous two examples.

Definition: https://en.wikipedia.org/wiki/Jaccard_index

I wrote a JS UDF to perform the desired computation:

create or replace function jaccard_sim(A array, B array)
returns string
language javascript
as $$
var union = new Set([...A, ...B]).size;

var intersection = new Set(
  Array.from(new Set(A)).filter(x => new Set(B).has(x))
).size;

return intersection/union

$$;

With this, select jaccard_sim(a, b) from data will work as expected.

I got the set operations for JS from https://exploringjs.com/impatient-js/ch_sets.html#union-ab .


The UDF above solves the problem. As a bonus, this is how the native Snowflake approximate_similarity / approximate_jaccard_index works:

with data as (
    select [1,2,3,4] a, [1,2,3,5] b
    union all select [20,30,90], [20,40,90]
)


select approximate_similarity(mh), seq, array_agg(arr)
from (
    select minhash(1023, value) mh, seq, any_value(a) arr
    from data, table(flatten(a))
    group by seq
    union all
    select minhash(1023, value) mh, seq, any_value(b) arr
    from data, table(flatten(b))
    group by seq
)
group by seq

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM