简体   繁体   中英

Insert values to Hive table in Pyspark of Row type

I'm new to working with Pyspark. I have a function which calculates the max of a query and inserts the max value which is of the type Row, along with two other values date and product name.

def findCount(query, prod_date, prod_name):
        count = query.agg({"count": "max"}).collect()[0] (returns Row(max(count)=Decimal('1.0000000000')))
        reopen = hc.sql('insert into details values(row_date, row_name, count)')
        print(=count)

This is the code which calls the function:

for row in aggs_list:
        prod_date= row.date
        prod_name = row.product_name
        query = prod_load.filter((col("date") == prod_date) & (col("prod_name") == row_name))
        findCount(query, prod_date, prod_name)

This is something I've tried and is not working. Is there a more efficient way to do this?

You should probably stay away from Row types, it usually means you have collected all the data to the driver. If so, there is no reason to use spark because you are not taking advantage of the parallelized computing environment.

You might be able to accomplish the following using spark sql:

max_data = spark.sql("SELECT product_name, max(count), product_date FROM table")

As far as inserting into the db (I'm guessing you're using Hive from hc most people would run the job daily and write the result into a date partitioned table like this:

First register temporary hive table max_data.registerTempTable("md")

Then overwrite the partition spark.sql("INSERT OVERWRITE new_table PARTITION(dt=product_date) SELECT * FROM md")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM