简体   繁体   English

从数据库获取每行DataFrame Pyspark的数据

[英]get Data from DB for each row DataFrame Pyspark

I'm using Pyspark Dataframe API in a streaming context, I have transformed an RDD to a DF foreach DStream in my spark streaming application (i'm using a kafka receiver) this is what I have done in my process RDD function : 我在流式上下文中使用Pyspark Dataframe API,我已经在我的spark流应用程序中将RDD转换为DF foreach DStream(我正在使用kafka接收器)这是我在我的进程RDD函数中所做的:

rowRdd = data_lined_parameters.map(
        lambda x: Row(SYS=x[0], METRIC='temp', SEN=x[1], OCCURENCE=x[2], THRESHOLD_HIGH=x[3], OSH=x[4], OSM=x[5], OEH=x[6], OEM=x[7],OSD=x[8],OED=x[9],REMOVE_HOLIDAYS=x[10],TS=x[11],VALUE=x[12],DAY=x[13],WEEKDAY=x[14],HOLIDAY=x[15]))
rawDataDF = sqlContext.createDataFrame(rowRdd)

rawDataRequirementsCheckedDF = rawDataDF.filter("WEEKDAY <= OED AND WEEKDAY >=OSD AND HOLIDAY = false  VALUE > THRESHOLD_HIGH  ")

My next step is to enrich each row in my rawDataRequirementsCheckedDF with new columns from an hbase table, my question is what's the most efficient way to get data from hbase (phoenix) and joined it to my original dataframe : 我的下一步是使用hbase表中的新列来丰富rawDataRequirementsCheckedDF中的每一行,我的问题是从hbase(phoenix)获取数据并将其加入我原始数据帧的最有效方法:

--------------------+-------+------+---------+---+---+---+---+---+---+---------------+---+----------------+--------------+--------------------+-------+-------+
|                 DAY|HOLIDAY|METRIC|OCCURENCE|OED|OEH|OEM|OSD|OSH|OSM|REMOVE_HOLIDAYS|SEN|             SYS|THRESHOLD_HIGH|                  TS|  VALUE|WEEKDAY|
+--------------------+-------+------+---------+---+---+---+---+---+---+---------------+---+----------------+--------------+--------------------+-------+-------+
|2017-08-03 00:00:...|  false|  temp|        3|  4| 19| 59|  0|  8|  0|           TRUE|  1|0201|            26|2017-08-03 16:22:...|28.4375|      3|
|2017-08-03 00:00:...|  false|  temp|        3|  4| 19| 59|  0|  8|  0|           TRUE|  1|0201|            26|2017-08-03 16:22:...|29.4375|      3|
+--------------------+-------+------+---------+---+---+---+---+---+---+---------------+---+----------------+--------------+--------------------+-------+-------+

The hbase table primary keys are DAY,SYS,SEN , so it will result in a dataframe with the same format. hbase表主键是DAY,SYS,SEN,因此它将生成具有相同格式的数据帧。

EDIT : 编辑:

This is what I have tried so far : 这是我到目前为止所尝试的:

sysList = rawDataRequirementsCheckedDF.map(lambda x : "'"+x['SYS']+"'").collect()
df_sensor = sqlContext.read.format("jdbc").option("dbtable","(select DATE,SYSTEMUID,SENSORUID,OCCURENCE from ANOMALY where SYSTEMUID in ("+','.join(sysList)+") )").option("url", "jdbc:phoenix:clustdev1:2181:/hbase-unsecure").option("driver", "org.apache.phoenix.jdbc.PhoenixDriver").load()
df_anomaly = rawDataRequirementsCheckedDF.join(df_sensor, col("SYS") == col("SYSTEMUID"), 'outer')

A simple way I bring data from HBase is creating the table into phoenix, and then loading into spark. 我从HBase中提取数据的一种简单方法是将表创建为phoenix,然后加载到spark中。 This is in the Apache Spark Plugin section of the Apache Phoenix page 这是Apache Phoenix页面的Apache Spark插件部分

df = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "TABLE1") \
.option("zkUrl", "localhost:2181") \
.load()

Link to Apache Spark Plugin: https://phoenix.apache.org/phoenix_spark.html 链接到Apache Spark插件: https//phoenix.apache.org/phoenix_spark.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM