繁体   English   中英

Python 中带有时间戳的字数

[英]Word Count with timestamp in Python

此示例摘自 Spark 的结构化流式编程指南:

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession \
        .builder \
        .appName("StructuredNetworkWordCount") \
        .getOrCreate()

# Create DataFrame representing the stream of input lines from connection to localhost:9999
   lines = spark \
     .readStream \
     .format("socket") \
     .option("host", "localhost") \
     .option("port", 9999) \
     .load()

# Split the lines into words
  words = lines.select(
    explode(
       split(lines.value, " ")
       ).alias("word"),
       lines.timestamp.alias('time')
)

# Generate running word count
 wordCounts = words.groupBy("word").count() #line to modify

# Start running the query that prints the running counts to the console
query = wordCounts \
      .writeStream \
      .outputMode("complete") \
      .format("console") \
      .start()

query.awaitTermination()

我需要创建一个包含每个单词及其输入时间的表格。 output 表应该是这样的:

+-------+--------------------+
|word   |              time  |
+-------+--------------------+
|   car |2021-12-16  12:21:..|
+-------+--------------------+

我该怎么做? 我认为标有“#line to modify”的行只是要修改的行。

试试,像这样:

streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  batchDF.persist()
  batchDF.write.format(...).save(...)  // location 1
  batchDF.write.format(...).save(...)  // location 2
  batchDF.unpersist()
}

你可以这样做:

writeStream
    .format("parquet")        // can be "orc", "json", "csv", etc.
    .option("path", "path/to/destination/dir")
    .start()

并制作一个外部表来指向,并在需要时自己设置路径。

请参阅https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch

Delta 还写入文件位置:

df.writeStream
  .format("delta")
  .outputMode("append")
  .option("checkpointLocation", "/delta/df/_checkpoints/etl-from-json")
  .start("/delta/df")

您可能想考虑“完整”。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM