简体   繁体   English

在 Databricks 中使用 Pyspark 更新数据库表

[英]Update database table with Pyspark in Databricks

I have a table in Azure SQL Server database which is populated from my Dataframe.我在 Azure SQL 服务器数据库中有一个表,它是从我的 Dataframe 填充的。 I want to udpate this table based upon multiple conditions databricks using pyspark / pandas.我想使用 pyspark / pandas 根据多个条件数据块更新此表。 Me being new to PySpark / Databricks / Pandas, can someone please advise how to update the table?我是 PySpark / Databricks / Pandas 的新手,有人能告诉我如何更新表格吗? i have inserted the data into the table - one solution that i could think of is to load the data from the table into a dataframe and then merge the new file into the same dataframe, then delete the data from table and insert this dataframe.我已将数据插入表中 - 我能想到的一种解决方案是将表中的数据加载到 dataframe 中,然后将新文件合并到同一个 dataframe 中,然后从表中删除数据并插入此 Z6A8064B5DF47955550 If this is the right approach, then how can we delete the data from database table in the above scenario?如果这是正确的方法,那么在上述场景中我们如何从数据库表中删除数据?

As you stated, 'load the data from the table into a dataframe and then merge the new file into the same dataframe, then delete the data from table and insert this dataframe.'如您所说,'将表中的数据加载到 dataframe 中,然后将新文件合并到相同的 dataframe 中,然后从表中删除数据并插入此 Z6A8064B5DF479455500553C47DZ5。 That's definitely one option.这绝对是一种选择。 I don't know if that's the absolute best practice, but it should be pretty darn fast, and almost certainly the preferred way to do this, because the cluster will run in parallel, and as such, data manipulation, calculations, etc., will be done super-fast, Of course.我不知道这是否是绝对的最佳实践,但它应该非常快,而且几乎可以肯定是这样做的首选方式,因为集群将并行运行,因此,数据操作、计算等,将完成超快,当然。 you can run SQL updates directly on the table, If that tables are really large (like billions of records and dozens of columns), it's probably going to be super-slow (SQL will NOT run in parallel on a cluster. whereas Spark will do exactly this).您可以直接在表上运行 SQL 更新,如果这些表真的很大(如数十亿条记录和数十列),它可能会超慢(SQL 不会在集群上并行运行。而 Spark 会这样做正是这个)。

See the link below for some additional ideas of what can be done.有关可以做什么的其他想法,请参见下面的链接。

https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM