![](/img/trans.png)
[英]How to convert Spark Streaming data into Spark DataFrame
[英]Convert Spark Structure Streaming DataFrames to Pandas DataFrame
我有一個設置為從Kafka主題使用的Spark Streaming App,我需要使用一些Pandas Dataframe中的API,但是當我嘗試將其轉換時,我得到了
: org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
kafka
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.completeString(QueryExecution.scala:219)
at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:202)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:62)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2832)
at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2809)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:745)
這是我的python代碼
spark = SparkSession\
.builder\
.appName("sparkDf to pandasDf")\
.getOrCreate()
sparkDf = spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", "kafkahost:9092")\
.option("subscribe", "mytopic")\
.option("startingOffsets", "earliest")\
.load()
pandas_df = sparkDf.toPandas()
query = sparkDf.writeStream\
.outputMode("append")\
.format("console")\
.option("truncate", "false")\
.trigger(processingTime="5 seconds")\
.start()\
.awaitTermination()
現在我知道我正在創建流數據幀的另一個實例,但是無論我在哪里嘗試使用start()和awaitTermination(),我都會遇到相同的錯誤。
有任何想法嗎?
TL; DR這樣的操作就是行不通的。
現在我知道我正在創建流數據幀的另一個實例
好吧,問題是您真的沒有。 toPandas
調用的DataFrame
在驅動程序節點的內存中創建了一個簡單的本地非分布式Pandas DataFrame
。
它不僅與Spark不相關,而且由於抽象與結構化流本質上不兼容DataFrame
表示一組固定的元組,而結構化流則表示無限的元組流。
目前尚不清楚您要實現的目標,這可能是XY問題,但是如果您確實需要將Pandas與結構化流一起使用,則可以嘗試使用pandas_udf
- SCALAR
和GROUPED_MAP
變體至少兼容基本的基於時間的觸發器(也可能支持其他變體,盡管某些組合顯然沒有任何意義,並且我不知道任何正式的兼容性列表)。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.