I'm attempting to write a pretty simple Glue job that processes a couple thousand Parquet records stored in S3 and submits the wrangled data to a Confluent Cloud Kafka queue. I'm new to working with Spark/Glue, and I'm getting an error I don't quite understand: AttributeError: type object 'Producer' has no attribute '__len__'
. I have no idea why an attempt to call len
on my Kafka producer would happen. While I'm sure that creating a Kafka Producer for every record is most likely a very bad practice, I'm not sure what my alternative is, since when I left the Producer in the global scope I got a Pickling Error. I've looked for a tutorial, but everything I've found is how to consume from Kafka in Glue, not produce. Here's my code, would really appreciate any help identifying the anti-patterns that's undermining the job.
import sys
import time
import certifi
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from confluent_kafka import Producer
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
producer_conf = {
"ssl.ca.location": certifi.where(),
"bootstrap.servers": "<my-bootstrap-servers>",
"security.protocol": "SASL_SSL",
"sasl.mechanisms": "PLAIN",
"sasl.username": "<my-username>",
"sasl.password": "<my-password>",
"session.timeout.ms": 45000
}
def send_to_kafka(row, topic="my-topic", partition_key="recordID"):
row = row.asDict()
timestamp_millis = int(time.time() * 1000)
row["timestampInEpoch"] = timestamp_millis
key = gen_key(row, partition_key)
value = gen_value(row)
global producer_conf
producer = Producer(producer_conf)
producer.produce(topic, key=key, value=value)
def gen_key(row, partition_key):
key = str(row[partition_key]).encode()
return key
def gen_value(row, cols):
value = json.dumps(row).encode()
return value
# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
format_options={},
connection_type="s3",
format="parquet",
connection_options={"paths": ["s3://path/to/parquet/files"], "recurse": True},
transformation_ctx="S3bucket_node1",
)
S3bucket_node1.toDF().foreach(send_to_kafka)
job.commit()
I'm not sure what my alternative is
The alternative is to use PySpark to write to Kafka .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.