![](/img/trans.png)
[英]ERROR: Mapping failed Error while parsing XML input stream. Current context not Object but root
[英]Parsing Nested XML with PySpark
我有一個復雜的 xml 文件,需要使用 PySpark 進行解析。 我將利用 AWS Glue 和 Spark 框架來完成這項任務。 我已經啟動該程序並面臨讀取 xml 文件的問題。 您能否提供有關如何解決此障礙的指導或示例。
您將在下方找到我的嵌套 XML 文件。 這個 XML 文件有 2 個頂級元素,它們是 PRVDR_INFO 和 ENROLMENT。 由於我有 2 個頂級元素,我正在考慮在不同的數據框中解析和分離它們並創建 FK 將它們鏈接在一起,從而創建父/子關系。 這方面的例子也會有所幫助,請。
<PROVIDER>
<PRVDR_INFO>
<INDVDL_INFO>
<BIRTH_DT>19831222</BIRTH_DT>
<BIRTH_STATE_CD>VA</BIRTH_STATE_CD>
<BIRTH_STATE_NAME>VIRGINIA</BIRTH_STATE_NAME>
<BIRTH_CNTRY_CD>US</BIRTH_CNTRY_CD>
<BIRTH_CNTRY_NAME>UNITED STATES</BIRTH_CNTRY_NAME>
<BIRTH_FRGN>X</BIRTH_FRGN>
<NAME_LIST>
<INDVDL_NAME>
<NAME_CD>I</NAME_CD>
<NAME_DESC>INDIVIDUAL NAME</NAME_DESC>
<FIRST_NAME>LEO</FIRST_NAME>
<LAST_NAME>MESSI</LAST_NAME>
<TRMNTN_DT>2010-12-27T09:43:18.000000000</TRMNTN_DT>
<DATA_STUS_CD>HISTORY</DATA_STUS_CD>
</INDVDL_NAME>
<INDVDL_NAME>
<NAME_CD>I</NAME_CD>
<NAME_DESC>INDIVIDUAL NAME</NAME_DESC>
<FIRST_NAME>LEO</FIRST_NAME>
<MDL_NAME>A</MDL_NAME>
<LAST_NAME>WHITE</LAST_NAME>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</INDVDL_NAME>
</NAME_LIST>
<XX_DEA>
<DEA_NUM>XX0919969</DEA_NUM>
<EFCTV_DT>20030103</EFCTV_DT>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</XX_DEA>
</INDVDL_INFO>
</PRVDR_INFO>
<ENROLMENT>
<ABC_999>
<ENRLMT_INFO>
<ENRLMT_DTLS>
<FORM_TYPE_CD>1111</FORM_TYPE_CD>
<ENRLMT_ID>I3994444141</ENRLMT_ID>
<ENRLMT_STUS_DLTS>
<STUS_CD>06</STUS_CD>
<STUS_DESC>APPROVED</STUS_DESC>
<STUS_DT>2019-09-25T14:11:08.0000000</STUS_DT>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
<ENRLMT_STUS_RSN>
<STUS_RSN_CD>048</STUS_RSN_CD>
<STUS_XXX_DESC>APPROVED AFTER 2nd CONTACT</STUS_XXX_DESC>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</ENRLMT_STUS_RSN>
</ENRLMT_STUS_DLTS>
</ENRLMT_DTLS>
</ENRLMT_INFO>
<PEC_SGNTR>
<CRTFCTN_SGNTR_DT>20101109</CRTFCTN_SGNTR_DT>
<FIRST_NAME>MIKE</FIRST_NAME>
<LAST_NAME>BLACK</LAST_NAME>
<TIN>555669999</TIN>
<TAX_IDENT_TYPE_CD>S</TAX_IDENT_TYPE_CD>
<SGNTR_EFCTV_DT>20101109</SGNTR_EFCTV_DT>
<SGNTR_STUS_CD>9</SGNTR_STUS_CD>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</PEC_SGNTR>
</ABC_999>
</ENROLMENT>
</PROVIDER>
要讀取 XML 文件,請使用以下代碼:
import boto3
from datetime import datetime, date
from datetime import timedelta
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import when
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.window import *
from pyspark.sql.functions import *
from pyspark.sql import *
from pyspark.sql.types import *
from dateutil.relativedelta import *
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
try:
# Location of xml file on s3
xml_file = "s3://glue-bucket/xml_files/"
rootTag = "PROVIDER"
rowTag = "PRVDR_INFO"
# Read xml files from s3
df = spark.read\
.format('xml')\
.option("rootTag", rootTag)\
.option("rowTag", rowTag)\
.load(xml_file)
df.printSchema()
except Exception as glue_exception_error:
print("##################### -- Error: " + str(glue_exception_error) + " -- ##########################")
raise
錯誤:
##################### -- Error: An error occurred while calling o92.load.
: java.lang.ClassNotFoundException:
Failed to find data source: xml. Please find packages at
https://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.errors.QueryExecutionErrors$.failedToFindDataSourceError(QueryExecutionErrors.scala:574)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:675)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
提前致謝。
正如您在錯誤中看到的那樣,您可能缺少依賴項:
: java.lang.ClassNotFoundException:
Failed to find data source: xml. Please find packages at
https://spark.apache.org/third-party-projects.html
您是否將任何用於解析 xml 的 jar 添加到$SPARK_HOME/jars
文件夾中? 如果沒有,請嘗試Maven Repository中spark-xml
。 要選擇正確的版本,請單擊可用版本之一並查看顯示了哪些 Spark 依賴項。 確保將您的 Spark 版本與您選擇的spark-xml
之一相匹配。
選擇版本后,下載.jar
文件並將其移動到$SPARK_HOME/jars
。 然后再次嘗試運行。
希望這可以幫助!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.