简体   繁体   中英

Iterate the records in a spark dataframe in Python

The code snippet is like this,

initial_load = hc.sql('select * from products_main') grouped_load = initial_load.groupBy("product_name", "date", "hour").count()

product_name hour date count

abc 12 2016-06-13 4
cde 13 2016-07-17 5
dfg 12 2016-10-13 7

Grouped load gives this output.

Now my aim is to iterate each product name in grouped_load from the initial load and retrieve the max and min values for price in the group.

How to iterate the records?

You can do something like, assuming that your initial load has a field price.

from pyspark.sql.functions import *
min_max_df = initial_load.groupBy("product_name", "date", "hour").agg(min("price"), max("price"))
Try below: not compiled, check
for syntax

initial_load = hc.sql('select * from products_main')
grouped_load = initial_load.groupBy("product_name", "date",
  "hour").count()
grouped_load2 = hc.sql('select product,min(value) from products_main group by product'))

final data = grouped_load.join(grouped_load2, on(product_name))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM