Does anyone knows how do I do to install the com.databricks.spark.xml package on EMR cluster.
I succeeded to connect to master emr but don't know how to install packages on the emr cluster.
code
sc.install_pypi_package("com.databricks.spark.xml")
On EMR Master node:
cd /usr/lib/spark/jars
sudo wget https://repo1.maven.org/maven2/com/databricks/spark-xml_2.11/0.9.0/spark-xml_2.11-0.9.0.jar
Make sure to select the correct jar according to your Spark version and the guidelines provided in https://github.com/databricks/spark-xml .
Then, launch your Jupyter notebook and you should be able to run the following:
df = spark.read.format('com.databricks.spark.xml').options(rootTag='objects').options(rowTag='object').load("s3://bucket-name/sample.xml")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.