简体   繁体   中英

Running Custom Java Class in PySpark on EMR

I am attempting to utilize the Cerner Bunsen package for FHIR processing in PySpark on an AWS EMR, specifically the Bundles class and it's methods. I am creating the spark session using the Apache Livy API,

def create_spark_session(master_dns, kind, jars):
    # 8998 is the port on which the Livy server runs
    host = 'http://' + master_dns + ':8998'
    data = {'kind': kind, 'jars': jars}
    headers = {'Content-Type': 'application/json'}
    response = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
    logging.info(response.json())
    return response.headers

Where kind = pyspark3 and jars is an S3 location that houses the jar (bunsen-shaded-1.4.7.jar)

The data transformation is attempting to import the jar and call the methods via:

# Setting the Spark Session and Pulling the Existing SparkContext
sc = SparkContext.getOrCreate()

# Cerner Bunsen
from py4j.java_gateway import java_import, JavaGateway
java_import(sc._gateway.jvm,"com.cerner.bunsen.Bundles")
func = sc._gateway.jvm.Bundles()

The error I am receiving is

"py4j.protocol.Py4JError: An error occurred while calling None.com.cerner.bunsen.Bundles. Trace:\npy4j.Py4JException: Constructor com.cerner.bunsen.Bundles([]) does not exist"

This is the first time I have attempted to use java_import so any help would be appreciated.

EDIT: I changed up the transformation script slightly and am now seeing a different error. I can see the jar being added in the logs so I am certain it is there and that the jars: jars functionality is working as intended. The new transformation is:

# Setting the Spark Session and Pulling the Existing SparkContext
sc = SparkContext.getOrCreate()

# Manage logging
#sc.setLogLevel("INFO")

# Cerner Bunsen
from py4j.java_gateway import java_import, JavaGateway
java_import(sc._gateway.jvm,"com.cerner.bunsen")
func_main = sc._gateway.jvm.Bundles
func_deep = sc._gateway.jvm.Bundles.BundleContainer

fhir_data_frame = func_deep.loadFromDirectory(spark,"s3://<bucket>/source_database/Patient",1)
fhir_data_frame_fromJson = func_deep.fromJson(fhir_data_frame)
fhir_data_frame_clean = func_main.extract_entry(spark,fhir_data_frame_fromJson,'patient')
fhir_data_frame_clean.show(20, False)

and the new error is:

'JavaPackage' object is not callable

Searching for this error has been a bit futile, but again, if anyone has ideas I will gladly take them.

If you want to use a Scala/Java function in Pyspark you have also to add the jar package in classpath. You can do it with 2 different ways:

Option1: In Spark submit with the flag --jars

 spark-submit example.py --jars /path/to/bunsen-shaded-1.4.7.jar

Option2: Add it in spark-defaults.conf file in property:

Add the following code in: path/to/spark/conf/spark-defaults.conf

# Comma-separated list of jars include on the driver and executor classpaths. 
spark.jars /path/to/bunsen-shaded-1.4.7.jar

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM