简体   繁体   English

在 emr 集群上安装 com.databricks.spark.xml

[英]Install com.databricks.spark.xml on emr cluster

Does anyone knows how do I do to install the com.databricks.spark.xml package on EMR cluster.有谁知道如何在 EMR 集群上安装com.databricks.spark.xml包。

I succeeded to connect to master emr but don't know how to install packages on the emr cluster.我成功连接到主 emr,但不知道如何在 emr 集群上安装软件包。

code代码

sc.install_pypi_package("com.databricks.spark.xml")

On EMR Master node:在 EMR 主节点上:

cd /usr/lib/spark/jars
sudo wget https://repo1.maven.org/maven2/com/databricks/spark-xml_2.11/0.9.0/spark-xml_2.11-0.9.0.jar

Make sure to select the correct jar according to your Spark version and the guidelines provided in https://github.com/databricks/spark-xml .确保根据您的 Spark 版本和https://github.com/databricks/spark-xml 中提供的指南选择正确的 jar。

Then, launch your Jupyter notebook and you should be able to run the following:然后,启动您的 Jupyter 笔记本,您应该能够运行以下命令:

df = spark.read.format('com.databricks.spark.xml').options(rootTag='objects').options(rowTag='object').load("s3://bucket-name/sample.xml")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM