[英]Finding top 10 trending tweets in Hive
我正在根據retweet_count在蜂巢中找到十大趨勢推文,即,具有最高retweet_count的推文將是第一等。
這是選舉表詳細信息
id bigint from deserializer
created_at string from deserializer
source string from deserializer
favorited boolean from deserializer
retweeted_status struct<text:string,user:struct<screen_name:string,name:string>,retweet_count:int> from deserializer
entities struct<urls:array<struct<expanded_url:string>>,user_mentions:array<struct<screen_name:string,name:string>>,hashtags:array<struct<text:string>>> from deserializer
text string from deserializer
user struct<screen_name:string,name:string,friends_count:int,followers_count:int,statuses_count:int,verified:boolean,utc_offset:int,time_zone:string,location:string> from deserializer
in_reply_to_screen_name string from deserializer
我的查詢
select text
from election
where retweeted_status.retweet_count IN
(select retweeted_status.retweet_count as zz
from election
order by zz desc
limit 10);
它給我10條相同的推文。 (TWEET-ABC,TWEET-ABC,TWEET-ABC,... TWEET-ABC)
所以我做的就是在運行內部查詢時打破嵌套查詢
select retweeted_status.retweet_count as zz
from election
order by zz desc
limit 10
它返回10個不同的值(1210,1209,1208,1207,1206,.... 1201)
之后,當我運行外部查詢時
select text
from election
where retweeted_status.retweet_count
IN (1210,1209,1208,1207,1206,....1201 );
結果是相同的10條推文(TWEET-ABC,TWEET-ABC,TWEET-ABC,... TWEET-ABC)
我的查詢邏輯出了什么問題?
而不是使用計數,您應該使用id。 那是因為如果您有100條相同計數的tweet,那么LIMIT 10無關緊要,您將獲得100條記錄。
select text
from election
where id IN
(select id as zz
from election
order by retweeted_status.retweet_count desc
limit 10);
但仍然不確定為什么會得到錯誤的結果。
編輯 (在我的評論之后):
如果我的評論是正確的,那么您將擁有十次相同的ID。 在這種情況下,請更改為
(select distinct id as zz
from election
order by retweeted_status.retweet_count desc
limit 10);
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.