[英]Finding largest number of location IDs per hour from each zone
I am using scala with spark and having a hard time understanding how to calculate the maximum count of pickups from a location corresponding to each hour.我正在使用带有 spark 的 scala 并且很难理解如何计算每小时对应的位置的最大接送次数。 Currently I have a df with three columns (Location,hour,Zone) where Location is an integer, hour is an integer 0-23 signifying the hour of the day and Zone is a string.目前我有一个包含三列 (Location,hour,Zone) 的 df,其中 Location 是一个整数,hour 是一个 0-23 的整数,表示一天中的小时,Zone 是一个字符串。 Something like this below:下面是这样的:
Location hour Zone
97 0 A
49 5 B
97 0 A
10 6 D
25 5 B
97 0 A
97 3 A
What I need to do is find out for each hour of the day 0-23, what zone has the largest number of pickups from a particular location我需要做的是找出一天中的每个小时 0-23 点,哪个区域从特定位置接客数量最多
So the answer should look something like this:所以答案应该是这样的:
hour Zone max_count
0 A 3
1 B 4
2 A 6
3 D 1
. . .
. . .
23 D 8
What I first tried was to use an intermediate step to figure out the counts per zone and hour我首先尝试的是使用中间步骤来计算每个区域和小时的计数
val df_temp = df.select("Location","hour","Zone")
.groupBy("hour","Zone").agg(count($"Location").alias("count"))
This gives me a dataframe that looks like this:这给了我一个看起来像这样的数据框:
hour Zone count
3 A 5
8 B 9
3 B 2
23 F 8
23 A 1
23 C 4
3 D 12
. . .
. . .
I then tried doing the following:然后我尝试执行以下操作:
val df_final = df_temp.select("hours","Zone","count")
.groupBy("hours","Zone").agg(max($"count").alias("max_count")).orderBy($"hours")
This doesn't do anything except just grouping by hours and zone but I still have 1000s of rows.除了按小时和区域分组外,这没有任何作用,但我仍然有 1000 行。 I also tried:我也试过:
val df_final = df_temp.select("hours","Zone","count")
.groupBy("hours").agg(max($"count").alias("max_count")).orderBy($"hours")
The above gives me the max count and 24 rows from 0-23 but there is no Zone column there.上面给出了从 0-23 的最大计数和 24 行,但那里没有 Zone 列。 So the answer looks like this:所以答案看起来像这样:
hour max_count
0 12
1 15
. .
. .
23 8
I would like the Zone column included so I know which zone had the max count for each of those hours.我希望包含 Zone 列,以便我知道每个小时的最大计数。 I was also looking into the window function to do rank but I wasn't sure how to use it.我也在研究窗口函数来做排名,但我不确定如何使用它。
After generating the dataframe with per-hour/zone "count", you could generate another dataframe with per-hour "max_count" and join the two dataframes on "hour" and "max_count":使用每小时/区域“计数”生成数据帧后,您可以使用每小时“max_count”生成另一个数据帧,并在“小时”和“max_count”上加入两个数据帧:
val df = Seq(
(97, 0, "A"),
(49, 5, "B"),
(97, 0, "A"),
(10, 6, "D"),
(25, 5, "B"),
(97, 0, "A"),
(97, 3, "A"),
(10, 0, "C"),
(20, 5, "C")
).toDF("location", "hour", "zone")
val dfC = df.groupBy($"hour", $"zone").agg(count($"location").as("count"))
val dfM = dfC.groupBy($"hour".as("m_hour")).agg(max($"count").as("max_count"))
dfC.
join(dfM, dfC("hour") === dfM("m_hour") && dfC("count") === dfM("max_count")).
drop("m_hour", "count").
orderBy("hour").
show
// +----+----+---------+
// |hour|zone|max_count|
// +----+----+---------+
// | 0| A| 3|
// | 3| A| 1|
// | 5| B| 2|
// | 6| D| 1|
// +----+----+---------+
Alternatively, you could perform the per-hour/zone groupBy
followed by a Window
partitioning by "hour" to compute "max_count" for the where
condition, as shown below:或者,您可以执行 per-hour/zone groupBy
然后按“hour”进行Window
分区以计算where
条件的“max_count”,如下所示:
import org.apache.spark.sql.expressions.Window
df.
groupBy($"hour", $"zone").agg(count($"location").as("count")).
withColumn("max_count", max($"count").over(Window.partitionBy("hour"))).
where($"count" === $"max_count").
drop("count").
orderBy("hour")
You can use spark window functions for this task.您可以为此任务使用火花窗口函数。
At first you can group by the data to get a count of number of zones.首先,您可以按数据分组以获得区域数量的计数。
val df = read_df.groupBy("hour", "zone").agg(count("*").as("count_order"))
Then create a window to partition the data by hour and order it by total count.然后创建一个窗口,按小时对数据进行分区,并按总数排序。 Then you have to calculate the rank over this partition of data.然后你必须计算这个数据分区的排名。
val byZoneName = Window.partitionBy($"hour").orderBy($"count_order".desc)
val rankZone = rank().over(byZoneName)
This will perform the operation and list out the rank of all the zones grouped by hour.这将执行操作并列出按小时分组的所有区域的等级。
val result_df = df.select($"*", rankZone as "rank")
The output will be something like this:输出将是这样的:
+----+----+-----------+----+
|hour|zone|count_order|rank|
+----+----+-----------+----+
| 0| A| 3| 1|
| 0| C| 2| 2|
| 0| B| 1| 3|
| 3| A| 1| 1|
| 5| B| 2| 1|
| 6| D| 1| 1|
+----+----+-----------+----+
You can then filter out the data with rank 1.然后,您可以过滤掉等级为 1 的数据。
result_df.filter($"rank" === 1).orderBy("hour").show()
You can check my code here: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5114666914683617/1792645088721850/4927717998130263/latest.html你可以在这里查看我的代码: https : //databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5114666914683617/17926450887921857139st39.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.