简体   繁体   中英

Save unmaterialized dataframes across Spark SQL sessions

I am using Spark to analyse data stored on a Cassandra cluster. Within a session, this works fine, but in future I would like to be able to connect to Tableau using their Spark SQL Connector . Due to our reliance on wide rows / dynamic columns, the data is not stored in Cassandra in a format suitable for direct usage as tables for analysis, so I have a series of Spark SQL operations that pivots selected data into a more usable structure.

I would like to be able to store the definition of this pivoted table across spark sessions, so that it can be picked up by new spark applications without requiring additional setup, and ideally also used in Tableau There's lots of documentation on using Hive to save materialised RDDs across sessions, but the dataset is large and changes often. I don't want to cache the calculated dataset, I'd just like to be able to easily re-use its definition.

It's possible Hive doesn't work the way I think it does, but it feels like I'm missing some obvious solution here.

I have a very similiar use case with HBase and qliksense, this will work with tableu as well. If you really want to solve this using the spark sql connector , as far as I know you will need a Spark server and im not sure you want to go that way (but its possible) In my case I use hive, as you said hive doesnt deal well with updates, but in general you shouldnt update this BI tools too often, in our case we create the hive table weekly and update the BI tool weekly, its also possible to do this daily , I doubt you will ever be able to do this alot faster because even if you use spark server you will still need to upload the data to the BI tool which shouldnt be done more then once a day for large data sets.

Anyway regarding hive, saving the data in hive should be simple and even if the data set is large it should still be way smaller then your cassandra table so it should be fine, so I would recommend to still use hive as a data holder and use tableu hive connector to upload the data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM