简体   繁体   中英

get Data from DB for each row DataFrame Pyspark

I'm using Pyspark Dataframe API in a streaming context, I have transformed an RDD to a DF foreach DStream in my spark streaming application (i'm using a kafka receiver) this is what I have done in my process RDD function :

rowRdd = data_lined_parameters.map(
        lambda x: Row(SYS=x[0], METRIC='temp', SEN=x[1], OCCURENCE=x[2], THRESHOLD_HIGH=x[3], OSH=x[4], OSM=x[5], OEH=x[6], OEM=x[7],OSD=x[8],OED=x[9],REMOVE_HOLIDAYS=x[10],TS=x[11],VALUE=x[12],DAY=x[13],WEEKDAY=x[14],HOLIDAY=x[15]))
rawDataDF = sqlContext.createDataFrame(rowRdd)

rawDataRequirementsCheckedDF = rawDataDF.filter("WEEKDAY <= OED AND WEEKDAY >=OSD AND HOLIDAY = false  VALUE > THRESHOLD_HIGH  ")

My next step is to enrich each row in my rawDataRequirementsCheckedDF with new columns from an hbase table, my question is what's the most efficient way to get data from hbase (phoenix) and joined it to my original dataframe :

--------------------+-------+------+---------+---+---+---+---+---+---+---------------+---+----------------+--------------+--------------------+-------+-------+
|                 DAY|HOLIDAY|METRIC|OCCURENCE|OED|OEH|OEM|OSD|OSH|OSM|REMOVE_HOLIDAYS|SEN|             SYS|THRESHOLD_HIGH|                  TS|  VALUE|WEEKDAY|
+--------------------+-------+------+---------+---+---+---+---+---+---+---------------+---+----------------+--------------+--------------------+-------+-------+
|2017-08-03 00:00:...|  false|  temp|        3|  4| 19| 59|  0|  8|  0|           TRUE|  1|0201|            26|2017-08-03 16:22:...|28.4375|      3|
|2017-08-03 00:00:...|  false|  temp|        3|  4| 19| 59|  0|  8|  0|           TRUE|  1|0201|            26|2017-08-03 16:22:...|29.4375|      3|
+--------------------+-------+------+---------+---+---+---+---+---+---+---------------+---+----------------+--------------+--------------------+-------+-------+

The hbase table primary keys are DAY,SYS,SEN , so it will result in a dataframe with the same format.

EDIT :

This is what I have tried so far :

sysList = rawDataRequirementsCheckedDF.map(lambda x : "'"+x['SYS']+"'").collect()
df_sensor = sqlContext.read.format("jdbc").option("dbtable","(select DATE,SYSTEMUID,SENSORUID,OCCURENCE from ANOMALY where SYSTEMUID in ("+','.join(sysList)+") )").option("url", "jdbc:phoenix:clustdev1:2181:/hbase-unsecure").option("driver", "org.apache.phoenix.jdbc.PhoenixDriver").load()
df_anomaly = rawDataRequirementsCheckedDF.join(df_sensor, col("SYS") == col("SYSTEMUID"), 'outer')

A simple way I bring data from HBase is creating the table into phoenix, and then loading into spark. This is in the Apache Spark Plugin section of the Apache Phoenix page

df = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "TABLE1") \
.option("zkUrl", "localhost:2181") \
.load()

Link to Apache Spark Plugin: https://phoenix.apache.org/phoenix_spark.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM