Iterate the records in a spark dataframe in Python

Question

The code snippet is like this,

initial_load = hc.sql('select * from products_main') grouped_load = initial_load.groupBy("product_name", "date", "hour").count()

product_name hour date count

abc 12 2016-06-13 4
cde 13 2016-07-17 5
dfg 12 2016-10-13 7

Grouped load gives this output.

Now my aim is to iterate each product name in grouped_load from the initial load and retrieve the max and min values for price in the group.

How to iterate the records?

Answer 1

You can do something like, assuming that your initial load has a field price.

from pyspark.sql.functions import *
min_max_df = initial_load.groupBy("product_name", "date", "hour").agg(min("price"), max("price"))

Answer 2

Try below: not compiled, check
for syntax

initial_load = hc.sql('select * from products_main')
grouped_load = initial_load.groupBy("product_name", "date",
  "hour").count()
grouped_load2 = hc.sql('select product,min(value) from products_main group by product'))

final data = grouped_load.join(grouped_load2, on(product_name))

Iterate the records in a spark dataframe in Python

Question

product_name hour date count

2 answers

solution1
0 2018-10-03 14:38:12

solution2
0 2018-10-04 08:40:59

Iterate the records in a spark dataframe in Python

Question

product_name hour date count

2 answers

solution1 0 2018-10-03 14:38:12

solution2 0 2018-10-04 08:40:59

solution1
0 2018-10-03 14:38:12

solution2
0 2018-10-04 08:40:59