I'm basing my question is this previous one, that missed sample data and desired results:
I want to write a UDF in Snowflake that can compute Jaccard similarity between two arrays:
with data as (
select [1,2,3,4] a, [1,2,3,5] b
union all select [20,30,90], [20,40,90]
)
select jaccard_sim(a, b)
from data
The desired results are 0.6 and 0.5, for the previous two examples.
Definition: https://en.wikipedia.org/wiki/Jaccard_index
I wrote a JS UDF to perform the desired computation:
create or replace function jaccard_sim(A array, B array)
returns string
language javascript
as $$
var union = new Set([...A, ...B]).size;
var intersection = new Set(
Array.from(new Set(A)).filter(x => new Set(B).has(x))
).size;
return intersection/union
$$;
With this, select jaccard_sim(a, b) from data
will work as expected.
I got the set operations for JS from https://exploringjs.com/impatient-js/ch_sets.html#union-ab .
The UDF above solves the problem. As a bonus, this is how the native Snowflake approximate_similarity
/ approximate_jaccard_index
works:
with data as (
select [1,2,3,4] a, [1,2,3,5] b
union all select [20,30,90], [20,40,90]
)
select approximate_similarity(mh), seq, array_agg(arr)
from (
select minhash(1023, value) mh, seq, any_value(a) arr
from data, table(flatten(a))
group by seq
union all
select minhash(1023, value) mh, seq, any_value(b) arr
from data, table(flatten(b))
group by seq
)
group by seq
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.