简体   繁体   English

MATCH_RECOGNIZE 的 FIRST() 和 LAST()

[英]FIRST() and LAST() for MATCH_RECOGNIZE

We are analyzing the streaming twitter data to find users who are posting similar (almost same) tweets over and over.我们正在分析流式 twitter 数据,以查找一遍又一遍发布类似(几乎相同)推文的用户。 I am using MATCH_RECOGNIZE for this.我为此使用 MATCH_RECOGNIZE。 It is able to find the pattern, but I am not able to get the FIRST() and the LAST() values correctly.它能够找到模式,但我无法正确获取 FIRST() 和 LAST() 值。 Here is sample dataset:这是示例数据集:

在此处输入图像描述

I am using the following Query:我正在使用以下查询:

SELECT 
  USERID
  , NUM_OF_TWEETS
  , FIRST_TWEET
  , LAST_TWEET
  , FIRST_TWEET_ID
  , LAST_TWEET_ID
FROM SCRATCH.SAQIB_ALI.TWEETS
MATCH_RECOGNIZE(
  PARTITION BY USERID
  ORDER BY TWEETID ASC
  MEASURES
    FIRST(TWEET) AS FIRST_TWEET,
    LAST(TWEET) AS LAST_TWEET,
    FIRST(TWEETID) AS FIRST_TWEET_ID,
    LAST(TWEETID) AS LAST_TWEET_ID,
    COUNT(*) AS NUM_OF_TWEETS
    
  ONE ROW PER MATCH
  PATTERN (SIMILAR+)
  DEFINE
    SIMILAR AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
    
);

This correct identify the users that are posting same tweets over an over:这正确识别了发布相同推文的用户: 在此处输入图像描述

But I am not able to get the first tweet and the last tweet in the matching sequence.但我无法获得匹配序列中的第一条推文和最后一条推文。

There are multiple things at play.有很多事情在起作用。

The first is you only "have one row trigging a match" so first and last are the second row of you data.第一个是你只有“有一行触发匹配”所以第一行和最后一行是你数据的第二行。 This can be seen by changing to ALL ROWS PER MATCH这可以通过更改为ALL ROWS PER MATCH来看到

with tweets(userid, tweetid, tweet) as (
    select * from values
    ('elena', 1, 'aaa'),
    ('elena', 2, 'aaaa')
)
SELECT 
*
FROM TWEETS
MATCH_RECOGNIZE(
  PARTITION BY USERID
  ORDER BY TWEETID ASC
  MEASURES
    match_number() as match_number,
    FIRST(TWEET) AS FIRST_TWEET,
    LAST(TWEET) AS LAST_TWEET,
    FIRST(TWEETID) AS FIRST_TWEET_ID,
    LAST(TWEETID) AS LAST_TWEET_ID,
    COUNT(*) AS NUM_OF_TWEETS
  ALL ROWS PER MATCH
  PATTERN (SIMILAR+)
  DEFINE
    SIMILAR AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
USERID用户身份 TWEETID推文 TWEET鸣叫 MATCH_NUMBER MATCH_NUMBER FIRST_TWEET FIRST_TWEET LAST_TWEET LAST_TWEET FIRST_TWEET_ID FIRST_TWEET_ID LAST_TWEET_ID LAST_TWEET_ID NUM_OF_TWEETS NUM_OF_TWEETS
elena埃琳娜 2 2 aaaa啊啊啊 1 1 aaaa啊啊啊 aaaa啊啊啊 2 2 2 2 1 1

if you change to say a match that catches the first value and the lag values:如果您更改为捕获第一个值和滞后值的匹配项:

  ALL ROWS PER MATCH
  PATTERN (SIMILAR_before SIMILAR_after+)
  DEFINE
    SIMILAR_before AS JAROWINKLER_SIMILARITY(TWEET, LEAD(TWEET)) > 90,
    SIMILAR_after AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90

you now match both the first and latter rows..你现在匹配第一行和后一行..

USERID用户身份 TWEETID推文 TWEET鸣叫 MATCH_NUMBER MATCH_NUMBER FIRST_TWEET FIRST_TWEET LAST_TWEET LAST_TWEET FIRST_TWEET_ID FIRST_TWEET_ID LAST_TWEET_ID LAST_TWEET_ID NUM_OF_TWEETS NUM_OF_TWEETS
elena埃琳娜 1 1 aaa啊啊啊 1 1 aaa啊啊啊 aaa啊啊啊 1 1 1 1 1 1
elena埃琳娜 2 2 aaaa啊啊啊 1 1 aaa啊啊啊 aaaa啊啊啊 1 1 2 2 2 2

now if we expand our test a little bit more with four rows of data:现在,如果我们用四行数据进一步扩展我们的测试:

with tweets(userid, tweetid, tweet) as (
    select * from values
    ('elena', 1, 'aaa'),
    ('elena', 2, 'aaaa'),
    ('elena', 3, 'aaa'),
    ('elena', 4, 'aaaa')
)
USERID用户身份 TWEETID推文 TWEET鸣叫 MATCH_NUMBER MATCH_NUMBER FIRST_TWEET FIRST_TWEET LAST_TWEET LAST_TWEET FIRST_TWEET_ID FIRST_TWEET_ID LAST_TWEET_ID LAST_TWEET_ID NUM_OF_TWEETS NUM_OF_TWEETS
elena埃琳娜 1 1 aaa啊啊啊 1 1 aaa啊啊啊 aaa啊啊啊 1 1 1 1 1 1
elena埃琳娜 2 2 aaaa啊啊啊 1 1 aaa啊啊啊 aaaa啊啊啊 1 1 2 2 2 2
elena埃琳娜 3 3 aaa啊啊啊 1 1 aaa啊啊啊 aaa啊啊啊 1 1 3 3 3 3
elena埃琳娜 4 4 aaaa啊啊啊 1 1 aaa啊啊啊 aaaa啊啊啊 1 1 4 4 4 4

we see those values are not double registering..我们看到这些值不是双重注册..

BUT we also see the first ID is correct for all rows, but the last is within the scope of the current matched row, so not after all matches as you are hoping.但是我们也看到所有行的第一个 ID 都是正确的,但最后一个 ID 在当前匹配行的 scope 内,所以并不是你希望的所有匹配。

If we flip back to one row per match we do how ever get the results we are expecting.如果我们在one row per match我们将如何获得我们期望的结果。

with tweets(userid, tweetid, tweet) as (
    select * from values
    ('elena', 1, 'aaa'),
    ('elena', 2, 'aaaa'),
    ('scott', 3, 'aaaa'),
    ('eva', 4, 'bbbb'),
    ('eva', 5, 'bbbbb'),
    ('amy', 4, 'eeee'),
    ('amy', 5, 'zzzz')
)
SELECT 
 USERID
  , NUM_OF_TWEETS
  , FIRST_TWEET
  , LAST_TWEET
  , FIRST_TWEET_ID
  , LAST_TWEET_ID
FROM TWEETS
MATCH_RECOGNIZE(
  PARTITION BY USERID
  ORDER BY TWEETID ASC
  MEASURES
    match_number() as match_number,
    FIRST(TWEET) AS FIRST_TWEET,
    LAST(TWEET) AS LAST_TWEET,
    FIRST(TWEETID) AS FIRST_TWEET_ID,
    LAST(TWEETID) AS LAST_TWEET_ID,
    COUNT(*) AS NUM_OF_TWEETS
  ONE ROW PER MATCH
  PATTERN (SIMILAR_before SIMILAR_after+)
  DEFINE
    SIMILAR_before AS JAROWINKLER_SIMILARITY(TWEET, LEAD(TWEET)) > 90,
    SIMILAR_after AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
USERID用户身份 NUM_OF_TWEETS NUM_OF_TWEETS FIRST_TWEET FIRST_TWEET LAST_TWEET LAST_TWEET FIRST_TWEET_ID FIRST_TWEET_ID LAST_TWEET_ID LAST_TWEET_ID
elena埃琳娜 2 2 aaa啊啊啊 aaaa啊啊啊 1 1 2 2
eva伊娃 2 2 bbbb bbbb bbbbb bbbb 4 4 5 5

Naturally, I was working on this while Simeon was attacking the same problem.当然,我正在研究这个问题,而 Simeon 正在解决同样的问题。 I ran into similar issues and noted the logic applied to the window frame, and therefore you needed to account for certain functions only working from the matched row, and then you would miss the first, et al.我遇到了类似的问题,并注意到应用于 window 框架的逻辑,因此您需要考虑某些功能仅在匹配的行中工作,然后您会错过第一个等。

I did an old-school approach, nesting views to incrementally address the problem.我采用了一种老式的方法,嵌套视图以逐步解决问题。

Both solve the problem at hand - and while I like the use of MATCH_RECOGNIZE in the provided answer (it's more elegant as a single query), it may be difficult for others to understand.两者都解决了手头的问题 - 虽然我喜欢在提供的答案中使用 MATCH_RECOGNIZE(作为单个查询更优雅),但其他人可能难以理解。

--
-- create test table
--

create
or replace table tweets (
    userid varchar,
    tweetid integer,
    tweet varchar
);
--
-- create test data
--
insert into
    tweets (userid, tweetid, tweet)
values
    ('elena', 1, 'aaa'),
    ('elena', 2, 'aaaa'),
    ('scott', 3, 'aaaa'),
    ('eva', 4, 'bbbb'),
    ('eva', 5, 'bbbbb'),
    ('amy', 4, 'eeee'),
    ('amy', 5, 'zzzz');
--
-- Baseline view showing matching tweets by user
--
    CREATE
    OR REPLACE VIEW MATCHES AS (
        SELECT
            T1.USERID,
            T1.TWEETID AS TWEETID,
            T2.TWEETID AS MATCHING_TWEETID
        FROM
            TWEETS T1,
            TWEETS T2
        WHERE
            T1.USERID = T2.USERID
            AND JAROWINKLER_SIMILARITY(T1.TWEET, T2.TWEET) > 90
    );


--
-- create a view of non-repeating tweets
--
create or replace view single_tweets as (
        select
            userid,
            tweetid,
            count(*) as num_tweets
        from
            matches
        group by
            userid,
            tweetid
        having
            count(*) = 1
    );
    
    select * from single_tweets;
--
-- Create a view of only repeating tweets by tweetid
--
create
    or replace view repeating_tweets as (
        select
            userid,
            tweetid,
            matching_tweetid
        from
            matches
        where
            (userid, tweetid) not in (
                select
                    userid, tweetid
                from
                    single_tweets
            )
            and (userid,tweetid) not in (
                select
                    userid, tweetid
                from
                    matches
                where
                    matching_tweetid < tweetid
            )
        order by
            tweetid,
            matching_tweetid
    );


--
-- only report repeating tweets
--
select
    t.userid,
    min(t.tweet) as FIRST_TWEET,
    max(t.tweet) as LAST_TWEET,
    min(t.tweetid) as FIRST_TWEETID,
    max(t.tweetid) as LAST_TWEETID,
    count(rt.matching_tweetid) as num_tweets
from
    tweets t,
    repeating_tweets rt
where
    t.userid = rt.userid
    and t.tweetid = rt.matching_tweetid
group by
    t.userid,
    rt.tweetid;

Results:结果:

USERID  FIRST_TWEET LAST_TWEET  FIRST_TWEETID   LAST_TWEETID    NUM_TWEETS
eva     bbbb        bbbbb       4               5               2
elena   aaa         aaaa        1               2               2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM