简体   繁体   English

SQL REDDIT-Jaccard相似度

[英]SQL REDDIT - Jaccard Similarity

I am trying to implement a fancy SQL query but am having trouble with trying to execute the join and count. 我正在尝试实现一个精美的SQL查询,但是在尝试执行联接和计数时遇到了麻烦。

I have a very long table of data : 我有很长的数据表:

author | group | id |

daniel | group1| 118
adam   | group2| 126
harry  | group1| 221
daniel | group2| 323
daniel | group2| 122
daniel | group5| 322
harry  | group1| 222 
harry  | group1| 225

... ... ……

I want my output to look like: 我希望我的输出看起来像:

author1 | author2 | intersection | union

daniel | adam | 2 | 3
daniel | harry| 2 | 11
adam   | harry| 0 | 10

where intersection is defined as the # of groups where author1 & author2 have in common, and union = # of groups author1 + author - intersection. 其中交集定义为author1和author2共有的组数,并且union =组author1 + author-交集的组数。

I think the proper way to do this is by 我认为这样做的正确方法是

table a left join b table on a.group == b.group 表a.group == b.group上的左连接b表

but i can't figure out how to do the aggregate count. 但我不知道如何进行总计数。

Thanks enter code here 谢谢enter code here

“Jumping In” because 1) still don't see any answer 2) saw author's related question with BigQuery Tag “跳入”是因为1)仍然看不到任何答案2)看到了作者与BigQuery Tag相关的问题

So, theoretically, below query would make your task (using bigquery-samples.reddit.full table for below examples): 因此,从理论上讲,下面的查询将完成您的任务(以下示例使用bigquery-samples.reddit.full表):

BigQuery Legacy SQL: BigQuery旧版SQL:

SELECT
  a.author AS author1, 
  b.author AS author2, 
  SUM(a.subr = b.subr) AS count_intersection,
  EXACT_COUNT_DISTINCT(a.subr) + EXACT_COUNT_DISTINCT(b.subr) - SUM(a.subr = b.subr) AS count_union
FROM 
  (SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS a
CROSS JOIN 
  (SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS b
WHERE a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100

BigQuery Standard SQL: BigQuery标准SQL:

WITH subrs AS (
  SELECT author, subr 
  FROM `bigquery-samples.reddit.full` 
  GROUP BY 1, 2
)
SELECT
  a.author AS author1, 
  b.author AS author2, 
  COUNTIF(a.subr = b.subr) AS count_intersection,
  COUNT(DISTINCT a.subr) + COUNT(DISTINCT b.subr) - COUNTIF(a.subr = b.subr) AS count_union
FROM subrs AS a 
JOIN subrs AS b
ON a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100

If you will try to run them, you most likely to get below error 如果您尝试运行它们,则极有可能出现以下错误

An internal error occurred and the request could not be completed 发生内部错误,请求无法完成

The reason is because each of those two queries produces about trillion rows as a result of join (see below stats). 原因是因为这两个查询中的每个查询都由于联接而产生了约万亿行(请参阅以下统计信息)。 There are many ways to address this – below proposed way is to address this by tuning requirements. 解决此问题的方法有很多–建议的方法是通过调整需求来解决此问题。 Do you really need to involved into algorithms light authors with let's say just one or two subreddits? 您是否真的需要仅用一两个subredd参与算法的轻作者? Or – do you really want to find similarity between those who have just very few comments in specific subreddits? 还是–您是否真的想在特定子意见中只有很少评论的人之间找到相似之处?

See below, how introducing extra limits helps in executing above queries (note: lines is min limit count of entries per author per subr and subrs is min limit of number of subr per user) 参见下文,引入额外的限制如何有助于执行上述查询(注意: lines是每个subr的每个作者的最小限制计数, subrs是每个用户的subrs的最小限制)

在此处输入图片说明

Below is version that actually produces result w/o any type of failure: 以下是实际上不会产生任何类型的失败结果的版本:

Standard SQL 标准SQL

WITH authors AS (
  SELECT author FROM (
    SELECT author, COUNT(1) AS subrs FROM (
      SELECT author, subr, COUNT(1) AS lines 
      FROM `bigquery-samples.reddit.full` 
      GROUP BY 1, 2
      HAVING lines > 1
    ) 
    GROUP BY author
    HAVING subrs > 3
  )
),
subrs AS (
  SELECT author, subr 
  FROM `bigquery-samples.reddit.full` 
  WHERE author IN (SELECT author FROM authors)
  GROUP BY 1, 2
)
SELECT
  a.author AS author1, 
  b.author AS author2, 
  COUNTIF(a.subr = b.subr) AS count_intersection,
  COUNT(DISTINCT a.subr) + COUNT(DISTINCT b.subr) - COUNTIF(a.subr = b.subr) AS count_union
FROM subrs AS a JOIN subrs AS b
ON a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100

In similar way you can adjust Legacy SQL to make it work 您可以通过类似的方式调整旧版SQL以使其正常运行

This might be not the best way - but at least gives some hope on such tasks to be able to easily run within BigQuery, w/o going to other workarounds 这可能不是最好的方法-但至少给此类任务带来一些希望,使其能够在BigQuery中轻松运行,而无需其他解决方法

CREATE OR REPLACE FUNCTION public.jaccard_similarity(IN vector anyarray)
    RETURNS double precision[]
    LANGUAGE 'plpgsql'

AS $BODY$
BEGIN
    RETURN(select ARRAY(
            select(
                select (SELECT COUNT(*) FROM (select vector1 INTERSECT select vector2) as intersect_elements)/(SELECT COUNT(*) FROM(select vector1 UNION select vector2) as union_elements) from unnest($1,"TOPIC_VECTOR") as t(vector1,vector2)) 
            from public.tbl_topic) 
            as score);

END;
$BODY$;

ALTER FUNCTION public.jaccard_similarity(anyarray)
    OWNER TO postgres;

COMMENT ON FUNCTION public.jaccard_similarity(anyarray)
    IS 'this function is used for calculating a jaccard similarity of input vector with all vector in databse.';

one can use this function for referenenec. 可以使用此功能进行参考。 Thank you. 谢谢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM