[英]Word Count with timestamp in Python
此示例摘自 Spark 的结构化流式编程指南:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word"),
lines.timestamp.alias('time')
)
# Generate running word count
wordCounts = words.groupBy("word").count() #line to modify
# Start running the query that prints the running counts to the console
query = wordCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
我需要创建一个包含每个单词及其输入时间的表格。 output 表应该是这样的:
+-------+--------------------+
|word | time |
+-------+--------------------+
| car |2021-12-16 12:21:..|
+-------+--------------------+
我该怎么做? 我认为标有“#line to modify”的行只是要修改的行。
试试,像这样:
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
batchDF.write.format(...).save(...) // location 1
batchDF.write.format(...).save(...) // location 2
batchDF.unpersist()
}
你可以这样做:
writeStream
.format("parquet") // can be "orc", "json", "csv", etc.
.option("path", "path/to/destination/dir")
.start()
并制作一个外部表来指向,并在需要时自己设置路径。
Delta 还写入文件位置:
df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/delta/df/_checkpoints/etl-from-json")
.start("/delta/df")
您可能想考虑“完整”。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.