[英]Word Count using Spark Structured Streaming with Python
I'm very new to Spark.我对 Spark 很陌生。 This example is extracted from Structured Streaming Programming Guide of Spark:
此示例摘自 Spark 的结构化流式编程指南:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
# Start running the query that prints the running counts to the console
query = wordCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
I need to modify this code to count the words that start with letter "B" and having more than 6 counts.我需要修改此代码以计算以字母“B”开头且计数超过 6 个的单词。 How can I do it?
我该怎么做?
The solution is:解决方案是:
wordCountsDF = wordsDF.groupBy('word').count().where('word.startsWith("B")' and 'count > 6')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.