简体   繁体   中英

Referencing Databricks Tables in Notebooks

I'm curious if there's a way to reference Databricks tables without importing them to every Databricks notebook.

Here's what I normally do:

'''

# Load the required tables
df1 = spark.read.load("dbfs:/hive_metastore/cadors_basic_event")
# Convert the dataframe to a temporary view for SQL processing
df1.createOrReplaceTempView('Event')

# Perform join to create master table
master_df = spark.sql(f'''
  SELECT O.CADORSNUMBER, O.EVENT_CD, O.EVENT_SEQ_NUM,\
  E.EVENT_NAME_ENM, E.EVENT_NAME_FNM, E.EVENT_DESCRIPTION_ETXT, E.EVENT_DESCRIPTION_FTXT,\
  E.EVENT_GROUP_TYPE_CD, O.DATE_CREATED_DTE, O.DATE_LAST_UPDATE_DTE\
  FROM Occ_Events O INNER JOIN Event E\
  ON O.EVENT_CD = E.EVENT_CD\
  ORDER BY O.CADORSNUMBER''')

'''

However, I also remember in SQL Server Management Studio, you could easily reference these tables and their fields without having to "import" the table into each notebook like I did above. For example:

'''

SELECT occ.cadorsnumber,\
       occ_evt.event_seq_num, occ_evt.event_cd,\
       evt.event_name_enm, evt.event_group_type_cd,\
       evt_grp.event_group_type_elbl\
   FROM cadorsstg.occurrence_information occ\
       JOIN cadorsstg.occurrence_events occ_evt ON (occ_evt.cadorsnumber = occ.cadorsnumber)\
       JOIN cadorsstg.ta003_event evt ON (evt.event_cd = occ_evt.event_cd)\
       JOIN cadorsstg.ta012_event_group_type evt_grp ON (evt_grp.event_group_type_cd = evt.event_group_type_cd)\
 WHERE occ.date_deleted_dte IS NULL AND occ_evt.date_deleted_dte IS NULL\
 ORDER BY occ.cadorsnumber, occ_evt.event_seq_num;

'''

The way I do it currently is not really scalable and gets very tedious when I'm working with multiple tables. If there's a better way to do this, I'd highly appreciate any tips/advice.

I've tried using SELECT/USE SCHEMA (database name), but that didn't work.

I agree with David there are several ways to do this and you are confusing the concepts. I am going to add some links for you to study.

1 - Data is stored in files. The storage can be either remote or local storage. To use remote storage, I suggest mounting since it allows older python libraries access to the storage. Only utilities such as dbutils.fs() understand urls.

https://www.mssqltips.com/sqlservertip/7081/transform-raw-file-refined-file-microsoft-azure-databricks-synapse/

2 - Data engineering is used to join and transform input files into a new output file. The spark.read() and spark.write() are key to reading and writing files using the power of the cluster.

https://docs.microsoft.com/en-us/azure/databricks/clusters/configure

This same processing can be done with python's libraries but it will not leverage the power of the worker nodes. It will run at the executor node. Please look into the high level design of a cluster.

3 - Data engineering can be done with dataframes. But this means you have to get very good at the methods associated with the object.

https://docs.microsoft.com/en-us/azure/databricks/getting-started/spark/dataframes

In the example below, I read in two data sample data files. I join the files, remove a duplicate column and save as a new file.

A - Read files code is the same for both design patterns (dataframes + pyspark)

# read in low temps
path1 = "/databricks-datasets/weather/low_temps"
df1 = (
  spark.read                    
  .option("sep", ",")        
  .option("header", "true")
  .option("inferSchema", "true")  
  .csv(path1)               
)

# read in high temps
path2 = "/databricks-datasets/weather/high_temps"
df2 = (
  spark.read                    
  .option("sep", ",")        
  .option("header", "true")
  .option("inferSchema", "true")  
  .csv(path2)               
)

B - The data engineering code uses methods when dealing with data frames.

# rename columns - file 1
df1 = df1.withColumnRenamed("temp", "low_temp")
    
# rename columns - file 2
df2 = df2.withColumnRenamed("temp", "high_temp")
df2 = df2.withColumnRenamed("date", "date2")

# join + drop col
df3 = df1.join(df2, df1["date"] == df2["date2"]).drop("date2")

# show top 5 rows
display(df3.head(5))

C - Write files code is the same for both design patterns (dataframes + pyspark)

Now that data frame (df3) has our data, we write to storage. The /lake/bronze directory is on local storage. It is a make believe data lake.

# How many partitions?
df3.rdd.getNumPartitions()

# Write out csv file with 1 partition
dst_path = "/lake/bronze/weather/temp"
(
  df3.repartition(1).write
    .format("parquet")
    .mode("overwrite")
    .save(dst_path)
)

4 - Data engineering can be done with Spark SQL. But this means you have to expose the datasets as temporary views. Both steps A + C are the same.

B.1 - This code exposes the dataframes as temporary views.

# create temp view
df1.createOrReplaceTempView("tmp_low_temps")

# create temp view
df2.createOrReplaceTempView("tmp_high_temps")

B.2 - This code replaces the methods with Spark SQL (pyspark).

# make sql string
sql_stmt = """
  select 
    l.date as obs_date,
    h.temp as obs_high_temp,
    l.temp as obs_low_temp
  from 
    tmp_high_temps as h
  join
    tmp_low_temps as l
  on
    h.date = l.date
"""

# execute
df3 = spark.sql(sql_stmt)

https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/

5 - Last but not least, who wants to always query the data using a data frame. We can create a HIVE database and TABLE to expose the stored file on the storage.

I have a utility function that finds the save file and renames it from a temporary subdirectory.

# create single file
unwanted_file_cleanup("/lake/bronze/weather/temp/", "/lake/bronze/weather/temperature-data.parquet", "parquet")

Last but not least, we create a database and table. Look into the concepts of managed and unmanaged tables as well as a remote meta store. I usually use unmanaged table with the default hive meta store.

%sql
DROP DATABASE IF EXISTS talks CASCADE

%sql
CREATE DATABASE IF NOT EXISTS talks

%sql
CREATE TABLE talks.weather_observations
  USING PARQUET
  LOCATION '/lake/bronze/weather/temperature-data.parquet'

https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-table.html

In short, I hope you now have a good understanding of data processing using either data frames or pyspark.

Sincerely

John Miner ~ The Crafty DBA ~ Data Platform MVP

PS: I have a couple videos out on you tube somewhere on this topic.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM