简体繁体 English

如何将表直接导入数据块中的 Python dataframe？

[英]how do I import a table DIRECTLY into a Python dataframe within databricks?

原文 2020-12-04 15:23:15 7 2 python/ databricks

currently working within a dev environment in Databricks using a notebook to apply some Python code to analyse some dummy data (just a few 1,000 rows) held in within a database table, I then deploy this to the main environment and run it on the real data, (100's of millions of rows)目前在 Databricks 的开发环境中使用笔记本应用一些 Python 代码来分析数据库表中保存的一些虚拟数据（只有几 1,000 行），然后我将其部署到主环境并在真实数据上运行, （数百万行）

to start with I just need values from a single column that meet a certain criteria, in order to get at the data I'm currently doing this:首先，我只需要满足特定条件的单个列中的值，以便获取我目前正在执行的数据：

spk_data = spark.sql("SELECT field FROM database.table WHERE field == 'value'") spk_data = spark.sql("SELECT field FROM database.table WHERE field == 'value'")
data = spk_data.toPandas()数据 = spk_data.toPandas()

then the rest of the Python notebook does its thing on that data which works fine in the dev environment but when I run it for real it falls over at line 2 saying it's out of memory然后 Python 笔记本的 rest 对在开发环境中运行良好的数据进行处理，但是当我真正运行它时，它在第 2 行掉了下来，说它超出了 ZCD69B4957F06CD818D7ZBF3D691

I want to import the data DIRECTLY into the Pandas dataframe and so remove the need to convert from Spark as I'm assuming that will avoid the error but after a LOT of Googling I still can't work out how, the only thing I've tried that appears syntactically valid is:我想将数据直接导入到 Pandas dataframe 中，因此无需从 Spark 转换，因为我假设这将避免错误，但经过大量谷歌搜索后，我仍然无法弄清楚如何，我唯一的事情是尝试过在语法上有效的是：

data = pd.read_table (r'database.table')数据 = pd.read_table (r'database.table')

but just get:但只要得到：

'PermissionError: [Errno 13] Permission denied:' 'PermissionError：[Errno 13] 权限被拒绝：'

(nb. unfortunately I have no control over the content, form or location of the database I'm querying) （nb。不幸的是，我无法控制我正在查询的数据库的内容、形式或位置）

2 个解决方案

Your assumption is very likely to be untrue.你的假设很可能是不正确的。

Spark is a distributed computation engine, pandas is a single-node toolset. Spark 是一个分布式计算引擎，pandas 是一个单节点工具集。 So when you run a query on milions of rows it's likely to fail.因此，当您对数百万行运行查询时，它可能会失败。 When doing df.toPandas, Spark moves all of the data to your driver node, so if it's more than driver memory, it's going to fail with out of memory exception.在执行 df.toPandas 时，Spark 会将所有数据移动到您的驱动程序节点，因此如果它超过驱动程序 memory，它将失败，并出现 memory 异常。 In other words, if your dataset is larger then memory, pandas are not going to work well.换句话说，如果您的数据集更大，那么 memory、pandas 将无法正常工作。

Also, when using pandas on databricks you are missing all of the benefits of using the underlying cluster.此外，在数据块上使用 pandas 时，您将失去使用底层集群的所有好处。 You are just using the driver.您只是在使用驱动程序。

There are two sensible options to solve this:有两个明智的选择可以解决这个问题：