I'm using databricks community edition. I created a temporary view.
%python
df.createOrReplaceTempView("athlete_events_csv")
The query i'm writing
with medal_count_by_country as
(SELECT NOC, Year, count(*) as medal_count, row_number() over( partition by NOC order by Year) as year_count
FROM athlete_events_csv
WHERE Medal in ('Gold', 'Silver', 'Bronze')
GROUP BY NOC, Year)
SELECT m1.NOC, m1.Year, m1.medal_count, m1.year_count, m2.year_count, ((m1.medal_count - m2.medal_count)/m1.medal_count)*100 as percentage_increase
FROM medal_count_by_country m1 left join medal_count_by_country m2
ON m1.NOC = m2.NOC AND m1.Year = m2.Year and m1.year_count-1 = m2.year_count
Can anyone please guide me why the 'm2.year' count is showing as 'Null'?
I've a data-set about country, year etc wise athlete event details. I'm trying to get YOY winners' percentage increase.
This looks like data from the Summer Olympics? They're only held every 4 years, you probably need your join to be m1.year_count-4 = m2.year_count
The condition in join
clause is wrong in this case.
The part that causes the issue is m1.Year = m2.Year
.
You are trying to join
data frames by year and then by row_number
ordered by the same year column. That is why you get only null
values. There are nothing to return.
Remove that part and you will receive proper results.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.