简体   繁体   中英

HiveQL query for data marked as table column names

I work in HDP 2.6.5 platformon using Hive (1.2.1000.2.6.5.0-292) on a simple database based on data from: https://grouplens.org/datasets/movielens/100k/
I have 4 tables named: genre, movies, ratings, users as below:

CREATE TABLE genre(genre string, genre_id int);
CREATE TABLE movies (movie_id INT, title STRING, rel_date DATE, video_rel_date STRING, 
imdb_url STRING, unknown INT, action INT, adventure INT, animation INT, childrens INT, 
comedy INT, crime INT, documentary INT, drama INT, fantasy INT, noir INT, horror INT, 
musical INT, mystery INT, romance INT, sci_fi INT, thriller INT, war INT, western INT) 
CLUSTERED BY (movie_id) INTO 12 BUCKETS STORED AS ORC;
CREATE TABLE ratings(user_id int, movie_id int, rating int, rating_time int);
CREATE TABLE users(user_id int, age int, gender char(1), occupation string, zip int);

I would like to write a query returning which genre of movies was watched most often by women and which by men? But the problem for me is the structure of the movies table where the movie genre is located:

1|Toy Story (1995)|1995-01-01||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0

The last 19 fields are the genres, a '1' indicates the movie is of that genre, a '0' indicates it is not. Additionally movies can be in several genres at once. The gender is represented in 'users' table as 'M' or 'F' char. The required tables can be easily joined, but how to return and group the genres which are the columns names?

SELECT m.title, r.rating, u.gender
FROM movies m INNER JOIN ratings r ON (m.movie_id = r.movie_id) 
INNER JOIN users u ON (u.user_id = r.user_id);

Awful data model. You should have a table with one row per movie and genre.

To solve this problem, I would suggest unpivoting to aggregate:

select mg.*
from (select m.movie_id, u.gender, count(*) as cnt,
             rank() over (partition by gender order by count(*) desc) as seqnum
      from ((select movie_id, 'action' from movies where action = 1) union all
            (select movie_id, 'adventure' from movies where adventure = 1) union all
            . . .
           ) m join
           ratings r
           on r.movie_id = m.movie_id join
           users u
           on r.user_id = u.user_id
       group by m.movie_id, u.gender
      ) mg
where seqnum = 1;

Make an array of genre columns placed in the order corresponding to genre_id, explode array and join by position in array with genre table. Like this(not tested):

   select s.title, s.genre, s.gender, s.rating, s.cnt
    from
    (select s.title, s.gender, s.rating, s.cnt, s.genre,
           rank() over (partition by s.gender order by s.cnt desc) as rnk
     from
       (
        select m.title, u.gender, r.rating, g.genre, count(*) over(partition by u.gender) cnt
          from    
            (select m.movie_id, m.title, e.id+1 as genre_id
               from movies m
                  lateral view 
                  posexplode (array(--place columns in a positions corresponding their genre_id
                              unknown, action, adventure, animation, childrens, 
                              comedy, crime, documentary, drama, fantasy, 
                              noir, horror, musical, mystery, romance, 
                              sci_fi, thriller, war, western
                                   )
                             )e as id, val
              where e.val=1
             ) m
             INNER JOIN ratings r ON (m.movie_id = r.movie_id) 
             INNER JOIN users   u ON (u.user_id = r.user_id)
             INNER JOIN genre   g ON (g.genre_id = m.genre_id)
         ) s
    ) s where rnk = 1

          

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM