简体   繁体   中英

How do I write a Glue job that produces to a Kafka queue?

I'm attempting to write a pretty simple Glue job that processes a couple thousand Parquet records stored in S3 and submits the wrangled data to a Confluent Cloud Kafka queue. I'm new to working with Spark/Glue, and I'm getting an error I don't quite understand: AttributeError: type object 'Producer' has no attribute '__len__' . I have no idea why an attempt to call len on my Kafka producer would happen. While I'm sure that creating a Kafka Producer for every record is most likely a very bad practice, I'm not sure what my alternative is, since when I left the Producer in the global scope I got a Pickling Error. I've looked for a tutorial, but everything I've found is how to consume from Kafka in Glue, not produce. Here's my code, would really appreciate any help identifying the anti-patterns that's undermining the job.

import sys
import time

import certifi

from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

from confluent_kafka import Producer

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

producer_conf = {
    "ssl.ca.location": certifi.where(),
    "bootstrap.servers": "<my-bootstrap-servers>",
    "security.protocol": "SASL_SSL",
    "sasl.mechanisms": "PLAIN",
    "sasl.username": "<my-username>",
    "sasl.password": "<my-password>",
    "session.timeout.ms": 45000
}


def send_to_kafka(row, topic="my-topic", partition_key="recordID"):
    row = row.asDict()
    
    timestamp_millis = int(time.time() * 1000)
    row["timestampInEpoch"] = timestamp_millis
    
    key = gen_key(row, partition_key)
    value = gen_value(row)
    
    global producer_conf
    producer = Producer(producer_conf)
    producer.produce(topic, key=key, value=value)
    
    
def gen_key(row, partition_key):
    key = str(row[partition_key]).encode()
    return key
    
    
def gen_value(row, cols):
    value = json.dumps(row).encode()
    return value
    
    
# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
    format_options={},
    connection_type="s3",
    format="parquet",
    connection_options={"paths": ["s3://path/to/parquet/files"], "recurse": True},
    transformation_ctx="S3bucket_node1",
)

S3bucket_node1.toDF().foreach(send_to_kafka)

job.commit()

I'm not sure what my alternative is

The alternative is to use PySpark to write to Kafka .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM