简体   繁体   中英

Extracting “data” from Amazon Ion file

Has anyone worked with the Amazon Quantum Ledger Database (QLDB) Amazon ion files? If so, do you know how to extract the "data" part to formulate tables? Maybe use python to scrape the data? I am trying to get the "data" information from these files which are stored in s3 (I don't have access to QLDB so I cannot query directly) and then upload the results to Glue.

I am trying to perform an ETL job using GLue, but Glue doesn't like Amazon Ion files so I need to either query data from these files or scrape the files for relevant information.

Thanks. PS : by "data" information I mean this:

{
    PersonId:"4tPW8xtKSGF5b6JyTihI1U",
    LicenseNumber:"LEWISR261LL",
    LicenseType:"Learner",
    ValidFromDate:2016–12–20,
    ValidToDate:2020–11–15
}

ref : https://docs.aws.amazon.com/qldb/latest/developerguide/working.userdata.html

Have you tried working with the Amazon Ion library?

Assuming the data mentioned in the question is present in a file called "myIonFile.ion" and if the file has only ion objects in it, we can read the data from the file as follows:

from amazon.ion import simpleion

file = open("myIonFile.ion", "rb")                    # opening the file
data = file.read()                                    # getting the bytes for the file
iondata = simpleion.loads(data, single_value=False)   # Loading as ion data
print(iondata['PersonId'])                            # should print "4tPW8xtKSGF5b6JyTihI1U"

Further guidance on using the ion library is provided in the Ion Cookbook

Besides, I'm unsure about your use-case but interacting with QLDB can also be done via the QLDB Driver which has a direct dependency on the Ion library.

Nosiphiwe,

AWS Glue is able to read Amazon Ion input. Many other services and applications can't, though, so it's a good idea to use Glue to convert the Ion data to JSON. Note that Ion is a super-set of JSON, adding some data types to JSON, so converting Ion to JSON may cause some down-conversion .

One good way to get access to your QLDB documents from the QLDB S3 export is to use Glue to extract the document data, store it in S3 as JSON, and query it with Amazon Athena. The process would go as follows:

  1. Export your ledger data to S3
  2. Create a Glue crawler to crawl and catalog the exported data.
  3. Run a Glue ETL job to extract the revision data from the export files, convert it to JSON, and write it out to S3.
  4. Create a Glue crawler to crawl and catalog the extracted data.
  5. Query the extracted document revision data using Amazon Athena.

Take a look at the PySpark script below. It extracts just the revision metadata and data payload from the QLDB export files.

The QLDB export maps the table for each document, but separately from the revisions data. You'll have to do some extra coding to include the table name in your revision data in the output. The code below doesn't do this, so you'll end up with all of your revisions in one table in the output.

Also note that you'll get whatever revisions happen to be in the exported data. That is, you might get multiple document revisions for a given document ID. Depending on your intended use of the data, you may need to figure out how to grab just the latest revision of each document ID.

from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql.functions import explode
from pyspark.sql.functions import col
from awsglue.dynamicframe import DynamicFrame

# Initializations
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Load data.  'vehicle-registration-ion' is the name of your database in the Glue catalog for the export data.  '2020' is the name of your table in the Glue catalog.
dyn0 = glueContext.create_dynamic_frame.from_catalog(database = "vehicle-registration-ion", table_name = "2020", transformation_ctx = "datasource0")

# Only give me exported records with revisions
dyn1 = dyn0.filter(lambda line: "revisions" in line)

# Now give me just the revisions element and convert to a Spark DataFrame.
df0 = dyn1.select_fields("revisions").toDF()

# Revisions is an array, so give me all of the array items as top-level "rows" instead of being a nested array field.
df1 = df0.select(explode(df0.revisions))

# Now I have a list of elements with "col" as their root node and the revision 
# fields ("data", "metadata", etc.) as sub-elements.  Explode() gave me the "col"
# root node and some rows with null "data" fields, so filter out the nulls.
df2 = df1.where(col("col.data").isNotNull())

# Now convert back to a DynamicFrame
dyn2 = DynamicFrame.fromDF(df2, glueContext, "dyn2")

# Prep and send the output to S3
applymapping1 = ApplyMapping.apply(frame = dyn2, mappings = [("col.data", "struct", "data", "struct"), ("col.metadata", "struct", "metadata", "struct")], transformation_ctx = "applymapping1")
datasink0 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://YOUR_BUCKET_NAME_HERE/YOUR_DESIRED_OUTPUT_PATH_HERE/"}, format = "json", transformation_ctx = "datasink0")

I hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM