简体   繁体   中英

Read from glue cataloge using spark and not using dynamic frame(glue context)

Since our scheme is constant we are using spark.read() which is way faster then creating dynamic frame from option when data is stored in s3

So now wanted to read data from glue catalog using dynamic frame takes lot of time So wanted to read using spark read api Dataframe.read.format("").option("url","").option("dtable",schema.table name).load()

What to enter in format and url option and any other thing is required??

Short answer:

If you read/load the data directly using a SparkSession/SparkContext you'll get a pure Spark DataFrame instead of a DynamicFrame.

Different options when reading from spark:

  • Format: is the source format you are reading from, so it can be parquet, csv, json,..
  • load: it is the path to the source file/files you are reading from: it can be a local path, s3 path, hadoop path,...
  • options: plenty of different options like inferSchema if you want spark to to the best for you and guess the schema based on a taken sample of data or header = true in csv files.

An example:

df = spark.read.format("csv").option("header", true) .option("inferSchema", true).load("s3://path")

No DynamicFrame has been created in the previous example, so df will be a DataFrame unless you convert it into a DynamicFrame using glue API.


Long answer:

Glue catalog is only a aws Hive implementation itself. You create a glue catalog defining a schema, a type of reader, and mappings if required, and then this becomes available for different aws services like glue, athena or redshift-spectrum. The only benefit I see from using glue-catalogs is actually the integration with the different aws-services.

I think you can get the most from data-catalogs using crawlers and the integrations with athena and redshift-specturm, as well as loading them into glue jobs using a unified API.

You can always read using from_options glue method directly from different sources and formats using glue and you won't lose some of the great tools glue has, and it will still read it as a DynamicFrame.

If you don't want to get that data from glue for any reason you just can specify a DataFrame Schema and read directly using a SparkSession but keep in mind that you won't have access to bookmarks, and other tools although you can transform that DataFrame into a DynamicFrame.

An example of reading from s3 using spark directly into a DataFrame (fe in parquet, json or csv format), would be:

df = spark.read.parquet("s3://path/file.parquet")
df = spark.read.csv("s3a://path/*.csv")
df= spark.read.json("s3a://path/*.json")

That won't create any DynamicFrame unless you want to convert it to it, you'll get a pure Spark DataFrame.

Another way of doing it is using the format() method.

df = spark.read.format("csv").option("header", true) .option("inferSchema", true).load("s3://path")

Keep in mind that there are several options like "header" or "inferSchema" for a csv fe You'll need to know if you want to use them. It is best practice to define the schema in productions environments instead of using inferSchema but there are several use cases.

And furthermore you can always convert that pure DataFrame to a DynamicFrame if needed using:

DynamicFrame.fromDF(df, glue_context, ..)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM