簡體   English   中英

Pyspark 中的 Kafka “partition.assignment.strategy”

[英]Kafka “partition.assignment.strategy” in Pyspark

我正在嘗試讀取數據以將其轉換為 Dataframe 並且我的軟件的當前版本如下:

  1. spark-2.4.7-bin-hadoop2.7
  2. kafka_2.12-2.7.0

卡夫卡正在工作,我存儲了以下數據,我正在嘗試讀取:

~/development/kafka_home/kafka_2.13-2.6.0$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic testtopic --from-beginning
{"transaction_id": "1", "transaction_card_type": "Visa", "transaction_amount": 181.76, "transaction_datetime": "2021-01-25 15:44:44"}
{"transaction_id": "2", "transaction_card_type": "MasterCard", "transaction_amount": 228.62, "transaction_datetime": "2021-01-25 15:44:45"}
{"transaction_id": "3", "transaction_card_type": "Visa", "transaction_amount": 483.48, "transaction_datetime": "2021-01-25 15:44:46"}
{"transaction_id": "4", "transaction_card_type": "MasterCard", "transaction_amount": 477.87, "transaction_datetime": "2021-01-25 15:44:47"}
{"transaction_id": "5", "transaction_card_type": "MasterCard", "transaction_amount": 304.52, "transaction_datetime": "2021-01-25 15:44:48"}
{"transaction_id": "1", "transaction_card_type": "MasterCard", "transaction_amount": 346.99, "transaction_datetime": "2021-01-25 16:38:44"}
{"transaction_id": "2", "transaction_card_type": "Maestro", "transaction_amount": 384.33, "transaction_datetime": "2021-01-25 16:38:45"}
{"transaction_id": "3", "transaction_card_type": "MasterCard", "transaction_amount": 394.95, "transaction_datetime": "2021-01-25 16:38:46"}
{"transaction_id": "4", "transaction_card_type": "Visa", "transaction_amount": 22.75, "transaction_datetime": "2021-01-25 16:38:47"}
{"transaction_id": "5", "transaction_card_type": "MasterCard", "transaction_amount": 492.01, "transaction_datetime": "2021-01-25 16:38:48"}

我在 PySpark 中執行以下代碼

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

KAFKA_TOPIC_NAME_CONS = "testtopic"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'localhost:9092'

spark = SparkSession \
    .builder \
    .appName("PySpark Structured Streaming with Kafka Demo") \
    .config("spark.jars", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/kafka-clients-1.1.0.jar") \
    .config("spark.jars", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-streaming-kafka-0-8-assembly_2.11-2.4.7.jar") \
    .config("spark.jars", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-sql-kafka-0-10_2.11-2.4.7.jar") \
    .config("spark.executor.extraClassPath", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/kafka-clients-1.1.0.jar") \
    .config("spark.executor.extraClassPath", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-streaming-kafka-0-8-assembly_2.11-2.4.7.jar") \
    .config("spark.executor.extraClassPath", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-sql-kafka-0-10_2.11-2.4.7.jar") \
    .config("spark.driver.extraClassPath", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/kafka-clients-1.1.0.jar") \
    .config("spark.driver.extraClassPath", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-streaming-kafka-0-8-assembly_2.11-2.4.7.jar") \
    .config("spark.driver.extraClassPath", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-sql-kafka-0-10_2.11-2.4.7.jar") \
    .config("spark.executor.extraLibrary", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/kafka-clients-1.1.0.jar") \
    .config("spark.executor.extraLibrary", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-streaming-kafka-0-8-assembly_2.11-2.4.7.jar") \
    .config("spark.executor.extraLibrary", "/home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-sql-kafka-0-10_2.11-2.4.7.jar") \
    .getOrCreate()

df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "testtopic").load()
ds = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
query = ds \
  .writeStream \
  .queryName("tableName") \
  .option("partition.assignment.strategy", "range")
  .format("console") \
  .start()

我得到的錯誤如下:

21/01/25 18:53:41 WARN kafka010.KafkaOffsetReader:嘗試 1 獲取 Kafka 偏移量時出錯:org.apache.kafka.common.config.ConfigException:缺少所需的配置“partition.assignment.strategy”,沒有默認值.

我做了一些研究,他們說名為“kafka-clients-1.1.0.jar”的.jar 文件似乎是問題所在,但是我已經訓練了 2.6.0 和 1.1.0 版本,結果相同。

**

編輯:

**

我在“spark-defaults”中添加了以下內容

spark.jars /home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-streaming-kafka-0-10_2.12-2.4.7.jar
spark.executor.extraClassPath /home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-streaming-kafka-0-10_2.12-2.4.7.jar
spark.driver.extraClassPath /home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-streaming-kafka-0-10_2.12-2.4.7.jar
spark.executor.extraLibrary /home/bupry_dev/development/spark_home/spark-2.4.7-bin-hadoop2.7/jars/spark-streaming-kafka-0-10_2.12-2.4.7.jar

並通過以下方式創建我的 Session:

spark = SparkSession \
    .builder \
    .appName("PySpark Structured Streaming with Kafka Demo") \
    .getOrCreate()

我仍然收到以下錯誤:

java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated

對於這行代碼:

df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "testtopic").load()

Spark docs中所述,您只需要包含以下依賴項:

groupId = org.apache.spark
artifactId = spark-streaming-kafka-0-10_2.11
version = 2.4.7 <-- replace this by your appropriate Spark version

Spark 警告不要直接使用kafka-clients*.jar ,因為它已經包含了那些 jars,並且為同一個庫添加多個 jars 會使調試更加困難。

不要手動添加對 org.apache.kafka 工件(例如 kafka-clients)的依賴。 spark-streaming-kafka-0-10 工件已經具有適當的傳遞依賴關系,並且不同的版本可能以難以診斷的方式不兼容。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM