使用带有 Python 的 Spark 结构化流的字数

Question

I'm very new to Spark.我对 Spark 很陌生。 This example is extracted from Structured Streaming Programming Guide of Spark:此示例摘自 Spark 的结构化流式编程指南：

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession \
            .builder \
            .appName("StructuredNetworkWordCount") \
            .getOrCreate()

# Create DataFrame representing the stream of input lines from connection to localhost:9999
       lines = spark \
         .readStream \
         .format("socket") \
         .option("host", "localhost") \
         .option("port", 9999) \
         .load()

# Split the lines into words
      words = lines.select(
        explode(
   split(lines.value, " ")
   ).alias("word")
   )

 # Generate running word count
     wordCounts = words.groupBy("word").count()

 # Start running the query that prints the running counts to the console
    query = wordCounts \
          .writeStream \
          .outputMode("complete") \
          .format("console") \
          .start()

query.awaitTermination()

I need to modify this code to count the words that start with letter "B" and having more than 6 counts.我需要修改此代码以计算以字母“B”开头且计数超过 6 个的单词。 How can I do it?我该怎么做？

Answer 1

The solution is:解决方案是：

wordCountsDF = wordsDF.groupBy('word').count().where('word.startsWith("B")' and 'count > 6')

使用带有 Python 的 Spark 结构化流的字数

问题描述

1 个解决方案

解决方案1
0 2021-12-16 11:15:53

使用带有 Python 的 Spark 结构化流的字数

问题描述

1 个解决方案

解决方案1 0 2021-12-16 11:15:53

解决方案1
0 2021-12-16 11:15:53