get Data from DB for each row DataFrame Pyspark

Question

I'm using Pyspark Dataframe API in a streaming context, I have transformed an RDD to a DF foreach DStream in my spark streaming application (i'm using a kafka receiver) this is what I have done in my process RDD function :

rowRdd = data_lined_parameters.map(
        lambda x: Row(SYS=x[0], METRIC='temp', SEN=x[1], OCCURENCE=x[2], THRESHOLD_HIGH=x[3], OSH=x[4], OSM=x[5], OEH=x[6], OEM=x[7],OSD=x[8],OED=x[9],REMOVE_HOLIDAYS=x[10],TS=x[11],VALUE=x[12],DAY=x[13],WEEKDAY=x[14],HOLIDAY=x[15]))
rawDataDF = sqlContext.createDataFrame(rowRdd)

rawDataRequirementsCheckedDF = rawDataDF.filter("WEEKDAY <= OED AND WEEKDAY >=OSD AND HOLIDAY = false  VALUE > THRESHOLD_HIGH  ")

My next step is to enrich each row in my rawDataRequirementsCheckedDF with new columns from an hbase table, my question is what's the most efficient way to get data from hbase (phoenix) and joined it to my original dataframe :

--------------------+-------+------+---------+---+---+---+---+---+---+---------------+---+----------------+--------------+--------------------+-------+-------+
|                 DAY|HOLIDAY|METRIC|OCCURENCE|OED|OEH|OEM|OSD|OSH|OSM|REMOVE_HOLIDAYS|SEN|             SYS|THRESHOLD_HIGH|                  TS|  VALUE|WEEKDAY|
+--------------------+-------+------+---------+---+---+---+---+---+---+---------------+---+----------------+--------------+--------------------+-------+-------+
|2017-08-03 00:00:...|  false|  temp|        3|  4| 19| 59|  0|  8|  0|           TRUE|  1|0201|            26|2017-08-03 16:22:...|28.4375|      3|
|2017-08-03 00:00:...|  false|  temp|        3|  4| 19| 59|  0|  8|  0|           TRUE|  1|0201|            26|2017-08-03 16:22:...|29.4375|      3|
+--------------------+-------+------+---------+---+---+---+---+---+---+---------------+---+----------------+--------------+--------------------+-------+-------+

The hbase table primary keys are DAY,SYS,SEN , so it will result in a dataframe with the same format.

EDIT :

This is what I have tried so far :

sysList = rawDataRequirementsCheckedDF.map(lambda x : "'"+x['SYS']+"'").collect()
df_sensor = sqlContext.read.format("jdbc").option("dbtable","(select DATE,SYSTEMUID,SENSORUID,OCCURENCE from ANOMALY where SYSTEMUID in ("+','.join(sysList)+") )").option("url", "jdbc:phoenix:clustdev1:2181:/hbase-unsecure").option("driver", "org.apache.phoenix.jdbc.PhoenixDriver").load()
df_anomaly = rawDataRequirementsCheckedDF.join(df_sensor, col("SYS") == col("SYSTEMUID"), 'outer')

Answer 1

A simple way I bring data from HBase is creating the table into phoenix, and then loading into spark. This is in the Apache Spark Plugin section of the Apache Phoenix page

df = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "TABLE1") \
.option("zkUrl", "localhost:2181") \
.load()

Link to Apache Spark Plugin: https://phoenix.apache.org/phoenix_spark.html

get Data from DB for each row DataFrame Pyspark

Question

1 answers

solution1
1 2017-08-03 16:47:09

get Data from DB for each row DataFrame Pyspark

Question

1 answers

solution1 1 2017-08-03 16:47:09

solution1
1 2017-08-03 16:47:09