Trying to convert a “org.apache.spark.sql.DataFrame” object to pandas dataframe results in error “name 'dataframe' is not defined” in Databricks

Question

I am trying to query an SQL database via jdbc connection in databricks and store the query results as a pandas dataframe.我正在尝试通过数据块中的 jdbc 连接查询 SQL 数据库，并将查询结果存储为 pandas4555064DZ5DF4470。 All of the methods I can find for this online involve storing it as a type of Spark object first using Scala code and then converting this to pandas.我可以在网上找到的所有方法都包括首先使用 Scala 代码将其存储为 Spark object 类型，然后将其转换为 pandas。 I tried for cell 1:我尝试了单元格 1：

%scala
val df_table1 = sqlContext.read.format("jdbc").options(Map(
    ("url" -> "jdbc:sqlserver://myserver.database.windows.net:1433;database=mydb"),
    ("dbtable" -> "(select top 10 * from myschema.table) as table"),
    ("user" -> "user"),
    ("password" -> "password123"),
    ("driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver"))
).load()

which results in:这导致：

df_table1: org.apache.spark.sql.DataFrame = [var1: int, var2: string ... 50 more fields]

Great: But when I try to convert it to a pandas df in cell 2 so I can use it:太好了：但是当我尝试将其转换为单元格 2 中的 pandas df 时，我可以使用它：

import numpy as np
import pandas as pd 

result_pdf = df_table1.select("*").toPandas()

print(result_pdf)

It generates the error message:它生成错误消息：

NameError: name 'df_table1' is not defined

How do I successfully convert this object to a pandas dataframe, or alternatively is there any way of querying the SQL database via jdbc connection using python code without needing to use Scala at all (I do not particularly like Scala syntax and would rather avoid it if at all possible)? How do I successfully convert this object to a pandas dataframe, or alternatively is there any way of querying the SQL database via jdbc connection using python code without needing to use Scala at all (I do not particularly like Scala syntax and would rather avoid it if完全可能）？

Answer 1

I am assuming that your intention to to query SQL using python and if thats the case the below query will work.我假设您打算使用 python 查询 SQL ，如果是这样，下面的查询将起作用。

%python
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
conf = SparkConf()
conf.setMaster("local").setAppName("My app")
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
database = "YourDBName"
table = "[dbo].[YourTabelName]"
user = "SqlUser"
password  = "SqlPassword"

DF1 = spark.read.format("jdbc") \
    .option("url", f"jdbc:sqlserver://YourAzureSql.database.windows.net:1433;databaseName={database};") \
    .option("dbtable", table) \
    .option("user", user) \
    .option("password", password) \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .load()
DF1.show()

table = "[dbo].[someOthertable]"

DF2 = spark.read.format("jdbc") \
    .option("url", f"jdbc:sqlserver://YourAzureSql.database.windows.net:1433;databaseName={database};") \
    .option("dbtable", table) \
    .option("user", user) \
    .option("password", password) \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .load()
DF2.show()

Finaldf = DF1.join(DF2,(DF1.Prop_0 == DF2.prop_0),how="inner").select(DF1.Prop_0,DF1.Prop_1,DF2.Address)
Finaldf.show()

Trying to convert a “org.apache.spark.sql.DataFrame” object to pandas dataframe results in error “name 'dataframe' is not defined” in Databricks

问题描述

1 个解决方案

解决方案1
0 2020-05-29 21:53:07

Trying to convert a “org.apache.spark.sql.DataFrame” object to pandas dataframe results in error “name 'dataframe' is not defined” in Databricks

问题描述

1 个解决方案

解决方案1 0 2020-05-29 21:53:07

解决方案1
0 2020-05-29 21:53:07