Databricks Python Optimization

Question

I need your help please, i have a simple code in python which lists all the fields in the tables in all the databases that are on databricks, there are a little nearly 90 tables and I would like to save the result in a txt or csv file. here is the code used it works but it takes 8 hours to finish it is too long how can I optimize or have another way for it to be faster?

#df_tables = spark.sql("SELECT * FROM bd_xyh_name")
#DynoSQL is a string table for result in txt

def discribe():
  try: 
     for i in df_tables.collect():
        showTables="""show tables in {};""".format(i.nombd)
        df1=spark.sql(showTables)
        for j in df1.collect():
            describeTable="""describe table {0}.{1};""".format(j.database,j.tableName)
            df2=spark.sql(describeTable)
            #df3=df2.collect()
            df3 = df2.rdd.toLocalIterator()
            for k in df3:
              #df=df2.select(df2.col_name;k.data_type)
              #spark.sql("insert into NewTable VALUES ("+j.database+";"+j.tableName+";"+k.col_name+";"+k.data_type+");")
              spark.sql("insert into DynoSQL select \""+j.database+";"+j.tableName+";"+k.col_name+";"+k.data_type+"\"")
             # request="insert into NewTable VALUES ({};{};{};{});""".format(j.database,j.tableName,k.col_name,k.data_type)
              #spark.sql(request)
             
  except:
    raise```

Answer 1

You can try with below logic.

Logic :

Get the available databases within workspace and make list
Iterate the databases name and get the available tables within databases and write into temp table. (Temp table you should create as managed one)

Advantage : Based on this logic, at a time only one databases only will be processed and if fails during the process, we can start from failing databases instead of whole workspace level.

Code Snippet :

from pyspark.sql.types import *
import pyspark.sql.functions as f
from pyspark.sql import functions as F
from pyspark.sql.functions import col, concat, lit

df = spark.sql("show databases")
list = [x["databaseName"] for x in df.collect()]

for x in list:
    df = spark.sql(f"use {x}")
    df1 = spark.sql("show tables")
    df_loc.write.insertInto("writeintotable")
display(df1)

Screenshot:

Databricks Python Optimization

Question

1 answers

solution1
0 2022-01-06 23:51:09

Databricks Python Optimization

Question

1 answers

solution1 0 2022-01-06 23:51:09

solution1
0 2022-01-06 23:51:09