简体   繁体   中英

How can I use the AWS Glue XML classifier?

I am trying to use an AWS Glue classifier to discover the schema for a set of XML files. I have the file stored in an s3 bucket like so:

s3://bucket/name_of_dataset/dataset.xml

There is only one xml file per dataset, so no partitioning. I routinely pull these into spark using spark-xml by simply specifying the rowtag. However, when I try to do something similar in AWS glue by using an XML classifier , the dataset ends up in the Glue Catalog as "unknown" classification. One dataset shows up (each xml dataset has a different schema), but the schema seems to "discover" a nested rowtag and not the rowtag I specified.

To be more concrete, if I store this file at s3://mybucket/experiment/experiment.xml , what should I specify as the rowtag (which appears to be the only argument)? Is there a better place to go for support?

<?xml version="1.0" encoding="UTF-8"?>
<EXPERIMENT_SET>
  <EXPERIMENT xmlns="" alias="GSM1627835" accession="SRX913316" center_name="GEO">
    <IDENTIFIERS>
      <PRIMARY_ID>SRX913316</PRIMARY_ID>
      <SUBMITTER_ID namespace="GEO">GSM1627835</SUBMITTER_ID>
    </IDENTIFIERS>
    <TITLE>GSM1627835: Human_normal_blsatoyst_MethylC-seq_1; Homo sapiens; Bisulfite-Seq</TITLE>
    <STUDY_REF accession="SRP064113">
      <IDENTIFIERS>
        <PRIMARY_ID>SRP064113</PRIMARY_ID>
        <EXTERNAL_ID namespace="BioProject">PRJNA296788</EXTERNAL_ID>
      </IDENTIFIERS>
    </STUDY_REF>
    <DESIGN>
      <DESIGN_DESCRIPTION/>
      <SAMPLE_DESCRIPTOR accession="SRS868521">
        <IDENTIFIERS>
...

Thanks in advance.

We had a similar issue with our XML source that we worked with the AWS technical support. It looks like there is a bug with the XML Crawler where, if there is an XML value that is empty (in the example you have given, the value for xmlns is ""), the Crawler seems to skip the classifer you have defined and defaults to a row tag that is most likely from a nested row in the XML.

They are working towards a fix for the same and it is likely to be released this week or next.

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM