[英]FIRST() and LAST() for MATCH_RECOGNIZE
We are analyzing the streaming twitter data to find users who are posting similar (almost same) tweets over and over.我们正在分析流式 twitter 数据,以查找一遍又一遍发布类似(几乎相同)推文的用户。 I am using MATCH_RECOGNIZE for this.我为此使用 MATCH_RECOGNIZE。 It is able to find the pattern, but I am not able to get the FIRST() and the LAST() values correctly.它能够找到模式,但我无法正确获取 FIRST() 和 LAST() 值。 Here is sample dataset:这是示例数据集:
I am using the following Query:我正在使用以下查询:
SELECT
USERID
, NUM_OF_TWEETS
, FIRST_TWEET
, LAST_TWEET
, FIRST_TWEET_ID
, LAST_TWEET_ID
FROM SCRATCH.SAQIB_ALI.TWEETS
MATCH_RECOGNIZE(
PARTITION BY USERID
ORDER BY TWEETID ASC
MEASURES
FIRST(TWEET) AS FIRST_TWEET,
LAST(TWEET) AS LAST_TWEET,
FIRST(TWEETID) AS FIRST_TWEET_ID,
LAST(TWEETID) AS LAST_TWEET_ID,
COUNT(*) AS NUM_OF_TWEETS
ONE ROW PER MATCH
PATTERN (SIMILAR+)
DEFINE
SIMILAR AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
This correct identify the users that are posting same tweets over an over:这正确识别了发布相同推文的用户:
But I am not able to get the first tweet and the last tweet in the matching sequence.但我无法获得匹配序列中的第一条推文和最后一条推文。
There are multiple things at play.有很多事情在起作用。
The first is you only "have one row trigging a match" so first and last are the second row of you data.第一个是你只有“有一行触发匹配”所以第一行和最后一行是你数据的第二行。 This can be seen by changing to ALL ROWS PER MATCH
这可以通过更改为ALL ROWS PER MATCH
来看到
with tweets(userid, tweetid, tweet) as (
select * from values
('elena', 1, 'aaa'),
('elena', 2, 'aaaa')
)
SELECT
*
FROM TWEETS
MATCH_RECOGNIZE(
PARTITION BY USERID
ORDER BY TWEETID ASC
MEASURES
match_number() as match_number,
FIRST(TWEET) AS FIRST_TWEET,
LAST(TWEET) AS LAST_TWEET,
FIRST(TWEETID) AS FIRST_TWEET_ID,
LAST(TWEETID) AS LAST_TWEET_ID,
COUNT(*) AS NUM_OF_TWEETS
ALL ROWS PER MATCH
PATTERN (SIMILAR+)
DEFINE
SIMILAR AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
USERID用户身份 | TWEETID推文 | TWEET鸣叫 | MATCH_NUMBER MATCH_NUMBER | FIRST_TWEET FIRST_TWEET | LAST_TWEET LAST_TWEET | FIRST_TWEET_ID FIRST_TWEET_ID | LAST_TWEET_ID LAST_TWEET_ID | NUM_OF_TWEETS NUM_OF_TWEETS |
---|---|---|---|---|---|---|---|---|
elena埃琳娜 | 2 2 | aaaa啊啊啊 | 1 1 | aaaa啊啊啊 | aaaa啊啊啊 | 2 2 | 2 2 | 1 1 |
if you change to say a match that catches the first value and the lag values:如果您更改为捕获第一个值和滞后值的匹配项:
ALL ROWS PER MATCH
PATTERN (SIMILAR_before SIMILAR_after+)
DEFINE
SIMILAR_before AS JAROWINKLER_SIMILARITY(TWEET, LEAD(TWEET)) > 90,
SIMILAR_after AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
you now match both the first and latter rows..你现在匹配第一行和后一行..
USERID用户身份 | TWEETID推文 | TWEET鸣叫 | MATCH_NUMBER MATCH_NUMBER | FIRST_TWEET FIRST_TWEET | LAST_TWEET LAST_TWEET | FIRST_TWEET_ID FIRST_TWEET_ID | LAST_TWEET_ID LAST_TWEET_ID | NUM_OF_TWEETS NUM_OF_TWEETS |
---|---|---|---|---|---|---|---|---|
elena埃琳娜 | 1 1 | aaa啊啊啊 | 1 1 | aaa啊啊啊 | aaa啊啊啊 | 1 1 | 1 1 | 1 1 |
elena埃琳娜 | 2 2 | aaaa啊啊啊 | 1 1 | aaa啊啊啊 | aaaa啊啊啊 | 1 1 | 2 2 | 2 2 |
now if we expand our test a little bit more with four rows of data:现在,如果我们用四行数据进一步扩展我们的测试:
with tweets(userid, tweetid, tweet) as (
select * from values
('elena', 1, 'aaa'),
('elena', 2, 'aaaa'),
('elena', 3, 'aaa'),
('elena', 4, 'aaaa')
)
USERID用户身份 | TWEETID推文 | TWEET鸣叫 | MATCH_NUMBER MATCH_NUMBER | FIRST_TWEET FIRST_TWEET | LAST_TWEET LAST_TWEET | FIRST_TWEET_ID FIRST_TWEET_ID | LAST_TWEET_ID LAST_TWEET_ID | NUM_OF_TWEETS NUM_OF_TWEETS |
---|---|---|---|---|---|---|---|---|
elena埃琳娜 | 1 1 | aaa啊啊啊 | 1 1 | aaa啊啊啊 | aaa啊啊啊 | 1 1 | 1 1 | 1 1 |
elena埃琳娜 | 2 2 | aaaa啊啊啊 | 1 1 | aaa啊啊啊 | aaaa啊啊啊 | 1 1 | 2 2 | 2 2 |
elena埃琳娜 | 3 3 | aaa啊啊啊 | 1 1 | aaa啊啊啊 | aaa啊啊啊 | 1 1 | 3 3 | 3 3 |
elena埃琳娜 | 4 4 | aaaa啊啊啊 | 1 1 | aaa啊啊啊 | aaaa啊啊啊 | 1 1 | 4 4 | 4 4 |
we see those values are not double registering..我们看到这些值不是双重注册..
BUT we also see the first ID is correct for all rows, but the last is within the scope of the current matched row, so not after all matches as you are hoping.但是我们也看到所有行的第一个 ID 都是正确的,但最后一个 ID 在当前匹配行的 scope 内,所以并不是你希望的所有匹配。
If we flip back to one row per match
we do how ever get the results we are expecting.如果我们在one row per match
我们将如何获得我们期望的结果。
with tweets(userid, tweetid, tweet) as (
select * from values
('elena', 1, 'aaa'),
('elena', 2, 'aaaa'),
('scott', 3, 'aaaa'),
('eva', 4, 'bbbb'),
('eva', 5, 'bbbbb'),
('amy', 4, 'eeee'),
('amy', 5, 'zzzz')
)
SELECT
USERID
, NUM_OF_TWEETS
, FIRST_TWEET
, LAST_TWEET
, FIRST_TWEET_ID
, LAST_TWEET_ID
FROM TWEETS
MATCH_RECOGNIZE(
PARTITION BY USERID
ORDER BY TWEETID ASC
MEASURES
match_number() as match_number,
FIRST(TWEET) AS FIRST_TWEET,
LAST(TWEET) AS LAST_TWEET,
FIRST(TWEETID) AS FIRST_TWEET_ID,
LAST(TWEETID) AS LAST_TWEET_ID,
COUNT(*) AS NUM_OF_TWEETS
ONE ROW PER MATCH
PATTERN (SIMILAR_before SIMILAR_after+)
DEFINE
SIMILAR_before AS JAROWINKLER_SIMILARITY(TWEET, LEAD(TWEET)) > 90,
SIMILAR_after AS JAROWINKLER_SIMILARITY(TWEET, LAG(TWEET)) > 90
);
USERID用户身份 | NUM_OF_TWEETS NUM_OF_TWEETS | FIRST_TWEET FIRST_TWEET | LAST_TWEET LAST_TWEET | FIRST_TWEET_ID FIRST_TWEET_ID | LAST_TWEET_ID LAST_TWEET_ID |
---|---|---|---|---|---|
elena埃琳娜 | 2 2 | aaa啊啊啊 | aaaa啊啊啊 | 1 1 | 2 2 |
eva伊娃 | 2 2 | bbbb bbbb | bbbbb bbbb | 4 4 | 5 5 |
Naturally, I was working on this while Simeon was attacking the same problem.当然,我正在研究这个问题,而 Simeon 正在解决同样的问题。 I ran into similar issues and noted the logic applied to the window frame, and therefore you needed to account for certain functions only working from the matched row, and then you would miss the first, et al.我遇到了类似的问题,并注意到应用于 window 框架的逻辑,因此您需要考虑某些功能仅在匹配的行中工作,然后您会错过第一个等。
I did an old-school approach, nesting views to incrementally address the problem.我采用了一种老式的方法,嵌套视图以逐步解决问题。
Both solve the problem at hand - and while I like the use of MATCH_RECOGNIZE in the provided answer (it's more elegant as a single query), it may be difficult for others to understand.两者都解决了手头的问题 - 虽然我喜欢在提供的答案中使用 MATCH_RECOGNIZE(作为单个查询更优雅),但其他人可能难以理解。
--
-- create test table
--
create
or replace table tweets (
userid varchar,
tweetid integer,
tweet varchar
);
--
-- create test data
--
insert into
tweets (userid, tweetid, tweet)
values
('elena', 1, 'aaa'),
('elena', 2, 'aaaa'),
('scott', 3, 'aaaa'),
('eva', 4, 'bbbb'),
('eva', 5, 'bbbbb'),
('amy', 4, 'eeee'),
('amy', 5, 'zzzz');
--
-- Baseline view showing matching tweets by user
--
CREATE
OR REPLACE VIEW MATCHES AS (
SELECT
T1.USERID,
T1.TWEETID AS TWEETID,
T2.TWEETID AS MATCHING_TWEETID
FROM
TWEETS T1,
TWEETS T2
WHERE
T1.USERID = T2.USERID
AND JAROWINKLER_SIMILARITY(T1.TWEET, T2.TWEET) > 90
);
--
-- create a view of non-repeating tweets
--
create or replace view single_tweets as (
select
userid,
tweetid,
count(*) as num_tweets
from
matches
group by
userid,
tweetid
having
count(*) = 1
);
select * from single_tweets;
--
-- Create a view of only repeating tweets by tweetid
--
create
or replace view repeating_tweets as (
select
userid,
tweetid,
matching_tweetid
from
matches
where
(userid, tweetid) not in (
select
userid, tweetid
from
single_tweets
)
and (userid,tweetid) not in (
select
userid, tweetid
from
matches
where
matching_tweetid < tweetid
)
order by
tweetid,
matching_tweetid
);
--
-- only report repeating tweets
--
select
t.userid,
min(t.tweet) as FIRST_TWEET,
max(t.tweet) as LAST_TWEET,
min(t.tweetid) as FIRST_TWEETID,
max(t.tweetid) as LAST_TWEETID,
count(rt.matching_tweetid) as num_tweets
from
tweets t,
repeating_tweets rt
where
t.userid = rt.userid
and t.tweetid = rt.matching_tweetid
group by
t.userid,
rt.tweetid;
Results:结果:
USERID FIRST_TWEET LAST_TWEET FIRST_TWEETID LAST_TWEETID NUM_TWEETS
eva bbbb bbbbb 4 5 2
elena aaa aaaa 1 2 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.