简体   繁体   English

无法使用 pyspark 从 xml 加载数据

[英]Unable to load the data from xml using pyspark

Downloaded the data using the following command in jupyter.在 jupyter 中使用以下命令下载数据。

 !7z x stackoverflow.com-Posts.7z -oposts
# load xml file into spark data frame.
posts = spark.read.format("xml").option("rowTag", "row").load("./posts/Posts.xml")

Got the following error:出现以下错误:

Py4JJavaError: An error occurred while calling o532.load.
: java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
    at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

You need to pass jar to the sparkContext您需要将 jar 传递给 sparkContext

Jar path: https://repo1.maven.org/maven2/com/databricks/spark-xml_2.11/0.9.0/spark-xml_2.11-0.9.0.jar Jar 路径: https://repo1.maven.org/maven2/com/databricks/spark-xml_2.11/0.9.0/spark-xml_2.11-0.9.0.jar

pyspark --jars /home/Downloads/spark_jars/spark-xml_2.11-0.9.0.jar

df = spark.read.format("com.databricks.spark.xml").option("rowTag", "row").load("./posts/Posts.xml")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM