I need your help please, i have a simple code in python which lists all the fields in the tables in all the databases that are on databricks, there are a little nearly 90 tables and I would like to save the result in a txt or csv file. here is the code used it works but it takes 8 hours to finish it is too long how can I optimize or have another way for it to be faster?
#df_tables = spark.sql("SELECT * FROM bd_xyh_name")
#DynoSQL is a string table for result in txt
def discribe():
try:
for i in df_tables.collect():
showTables="""show tables in {};""".format(i.nombd)
df1=spark.sql(showTables)
for j in df1.collect():
describeTable="""describe table {0}.{1};""".format(j.database,j.tableName)
df2=spark.sql(describeTable)
#df3=df2.collect()
df3 = df2.rdd.toLocalIterator()
for k in df3:
#df=df2.select(df2.col_name;k.data_type)
#spark.sql("insert into NewTable VALUES ("+j.database+";"+j.tableName+";"+k.col_name+";"+k.data_type+");")
spark.sql("insert into DynoSQL select \""+j.database+";"+j.tableName+";"+k.col_name+";"+k.data_type+"\"")
# request="insert into NewTable VALUES ({};{};{};{});""".format(j.database,j.tableName,k.col_name,k.data_type)
#spark.sql(request)
except:
raise```
You can try with below logic.
Logic :
list
Advantage : Based on this logic, at a time only one databases only will be processed and if fails during the process, we can start from failing databases
instead of whole workspace level.
Code Snippet :
from pyspark.sql.types import *
import pyspark.sql.functions as f
from pyspark.sql import functions as F
from pyspark.sql.functions import col, concat, lit
df = spark.sql("show databases")
list = [x["databaseName"] for x in df.collect()]
for x in list:
df = spark.sql(f"use {x}")
df1 = spark.sql("show tables")
df_loc.write.insertInto("writeintotable")
display(df1)
Screenshot:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.