[英]Get appropriate data for each row in pandas dataframe from another dataframe
[英]get Data from DB for each row DataFrame Pyspark
我在流式上下文中使用Pyspark Dataframe API,我已經在我的spark流應用程序中將RDD轉換為DF foreach DStream(我正在使用kafka接收器)這是我在我的進程RDD函數中所做的:
rowRdd = data_lined_parameters.map(
lambda x: Row(SYS=x[0], METRIC='temp', SEN=x[1], OCCURENCE=x[2], THRESHOLD_HIGH=x[3], OSH=x[4], OSM=x[5], OEH=x[6], OEM=x[7],OSD=x[8],OED=x[9],REMOVE_HOLIDAYS=x[10],TS=x[11],VALUE=x[12],DAY=x[13],WEEKDAY=x[14],HOLIDAY=x[15]))
rawDataDF = sqlContext.createDataFrame(rowRdd)
rawDataRequirementsCheckedDF = rawDataDF.filter("WEEKDAY <= OED AND WEEKDAY >=OSD AND HOLIDAY = false VALUE > THRESHOLD_HIGH ")
我的下一步是使用hbase表中的新列來豐富rawDataRequirementsCheckedDF中的每一行,我的問題是從hbase(phoenix)獲取數據並將其加入我原始數據幀的最有效方法:
--------------------+-------+------+---------+---+---+---+---+---+---+---------------+---+----------------+--------------+--------------------+-------+-------+
| DAY|HOLIDAY|METRIC|OCCURENCE|OED|OEH|OEM|OSD|OSH|OSM|REMOVE_HOLIDAYS|SEN| SYS|THRESHOLD_HIGH| TS| VALUE|WEEKDAY|
+--------------------+-------+------+---------+---+---+---+---+---+---+---------------+---+----------------+--------------+--------------------+-------+-------+
|2017-08-03 00:00:...| false| temp| 3| 4| 19| 59| 0| 8| 0| TRUE| 1|0201| 26|2017-08-03 16:22:...|28.4375| 3|
|2017-08-03 00:00:...| false| temp| 3| 4| 19| 59| 0| 8| 0| TRUE| 1|0201| 26|2017-08-03 16:22:...|29.4375| 3|
+--------------------+-------+------+---------+---+---+---+---+---+---+---------------+---+----------------+--------------+--------------------+-------+-------+
hbase表主鍵是DAY,SYS,SEN,因此它將生成具有相同格式的數據幀。
編輯:
這是我到目前為止所嘗試的:
sysList = rawDataRequirementsCheckedDF.map(lambda x : "'"+x['SYS']+"'").collect()
df_sensor = sqlContext.read.format("jdbc").option("dbtable","(select DATE,SYSTEMUID,SENSORUID,OCCURENCE from ANOMALY where SYSTEMUID in ("+','.join(sysList)+") )").option("url", "jdbc:phoenix:clustdev1:2181:/hbase-unsecure").option("driver", "org.apache.phoenix.jdbc.PhoenixDriver").load()
df_anomaly = rawDataRequirementsCheckedDF.join(df_sensor, col("SYS") == col("SYSTEMUID"), 'outer')
我從HBase中提取數據的一種簡單方法是將表創建為phoenix,然后加載到spark中。 這是Apache Phoenix頁面的Apache Spark插件部分
df = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "TABLE1") \
.option("zkUrl", "localhost:2181") \
.load()
鏈接到Apache Spark插件: https : //phoenix.apache.org/phoenix_spark.html
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.