简体   繁体   中英

How to write a SQL query for multiple Inner Join?

A sample record:

    Row(user_id='KxGeqg5ccByhaZfQRI4Nnw', gender='male', year='2015', month='September', day='20', 
hour='16', weekday='Sunday', reviewClass='place love back', business_id='S75Lf-Q3bCCckQ3w7mSN2g', 
business_name='Notorious Burgers', city='Scottsdale', categories='Nightlife, American (New), Burgers, 
Comfort Food, Cocktail Bars, Restaurants, Food, Bars, American (Traditional)', user_funny='1', 
review_sentiment='Positive', friend_id='my4q3Sy6Ei45V58N2l8VGw')

This table has more than a 100 million records. My SQL query is doing the following:

Select the most occurring review_sentiment among the friends (friend_id) and the most occurring gender among friends of a particular user visiting a specific business

friend_id is eventually a user_id

Example Scenario:

  • One user
  • Has Visited 4 Businesses
  • Has 10 friends
  • 5 of these friends have visited Business 1 & 2 while other 5 have visited 3rd business only and none have visited the fourth
  • Now, for Business 1 and 2, the 5 friends have more positive than negative sentiments for B1 and have more -ve than +ve sentiment for B2 and all -ve for B3

I want the following output for this:

**user_id | business_id | friend_common_sentiment | mostCommonGender | .... otherCols**

user_id_1 | business_id_1 | positive | male | .... otherCols
user_id_1 | business_id_2 | negative | female | .... otherCols
user_id_1 | business_id_3 | negative | female | .... otherCols

Here's a simple query I wrote for this in pyspark :

SELECT user_id, gender, year, month, day, hour, weekday, reviewClass, business_id, business_name, city, 
categories, user_funny, review_sentiment FROM events1 GROUP BY user_id, friend_id, business_id ORDER BY 
COUNT(review_sentiment DESC LIMIT 1

This query will not give what is expected but I'm not sure how exactly to fit in a INNER-JOIN into this?

Man does that data structure make things hard. But lets break it down into steps,

  1. You need to self join to get the data for friends
  2. Once you have the data for friends, perform aggregate functions to get counts of each possible value, grouping by the user and the business
  3. sub query the above in order to make decisions between the values based on counts.

I'm just going to call your table "tags", so the join would be as follows, sadly just like in real life we can't assume everyone has friends, and since you didn't specify to exclude the forever alone crowd, we need to use a left join to keep users without friends.

From tags as user
left outer join tags as friends on user.friend_id = friends.user_id
    and friends.business_id = user.business_id

Next you have to figure out what the most common gender/review is for a given user and business combination. This is where the data structure really kicks us in the butt, we could do this in one step with some clever window functions, but I want this answer to be easily understood, so I'm going to use a sub-query and a case statements. For the sake of simplicity I'm assuming binary genders, but depending on the woke level of your app, you can follow the same patterns for additional genders.

select user.user_id, user.business_id
, sum(case when friends.gender = 'Male' then 1 else 0 end) as MaleFriends
, sum(case when friends.gender = 'Female' then 1 else 0 end) as FemaleFriends
, sum(case when friends.review_sentiment = 'Positive' then 1 else 0 end) as FriendsPositive
, sum(case when friends.review_sentiment = 'Negative' then 1 else 0 end) as FriendsNegative
From tags as user
left outer join tags as friends on user.friend_id = friends.user_id
  and friends.business_id = user.business_id
where user.business_id = <<your business id here>>
group by user.user_id, user.business_id

Now we just have to grab data from the sub-query and make some decisions, you may want to add some additional options, for instance you may want to add options in case there are no friends, or friends are evenly split between gender/sentiment. same pattern as below though with extra values to choose from.

select user_id
, business_id
, case when MaleFriends > than FemaleFriends then 'Male' else 'Female' as MostCommonGender
, case when FriendsPositive > FriendsNegative then 'Positive' else 'Negative' as MostCommonSentiment
from (    select user.user_id, user.business_id
, sum(case when friends.gender = 'Male' then 1 else 0 end) as MaleFriends
, sum(case when friends.gender = 'Female' then 1 else 0 end) as FemaleFriends
, sum(case when friends.review_sentiment = 'Positive' then 1 else 0 end) as FriendsPositive
, sum(case when friends.review_sentiment = 'Negative' then 1 else 0 end) as FriendsNegative
From tags as user
left outer join tags as friends on user.friend_id = friends.user_id
  and friends.business_id = user.business_id
where user.business_id = <<your business id here>>
group by user.user_id, user.business_id) as a

This gives you the steps to follow, and hopefully a clear explanation on how they work. Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM