简体繁体 English

平面文件（orc，csv）是否比火花中的增量表更有效

[英]Are flatfiles(orc,csv) more efficient than delta table in spark

原文 2019-12-17 16:34:03 9 1 apache-spark/ pyspark/ apache-spark-sql/ databricks

I am working with around 16 delta tables with around 1 to 3 million rows in databricks.我正在处理大约 16 个增量表，数据块中有大约 1 到 300 万行。
So when I am trying to perform an operation like join and then delete or insert in these delta tables it is taking a long time.因此，当我尝试执行像 join 这样的操作，然后在这些增量表中删除或插入时，需要很长时间。
I have to do mostly insert and delete operations.我必须做的主要是插入和删除操作。 So should I use flatfiles instead of delta tables.所以我应该使用平面文件而不是增量表。 Or should i try merge with delta tables insted.或者我应该尝试与插入的增量表合并。

Hence I had doubts about what are the advantages of delta and why not use flat files?因此我怀疑 delta 有什么优点，为什么不使用平面文件？

Its a basic question but I am still new to databricks so any help would be nice.这是一个基本问题，但我对 databricks 还是个新手，所以任何帮助都会很好。

1 个解决方案

Andy, it totally depends on your needs and expectations, but delta tables help for many data-engineering challenges. Andy，这完全取决于您的需求和期望，但增量表有助于应对许多数据工程挑战。

Delta tables behave like a transactional log and can be very helpful for many scenarios like Time traveling . Delta 表的行为类似于事务日志，对于诸如时间旅行之类的许多场景非常有帮助。 This gives the ability to rollback , reproduce some experiments (reading an older version of the data), allows to analyze difference between data versions (changes).这提供了回滚、重现一些实验（读取旧版本数据）的能力，允许分析数据版本之间的差异（更改）。

Also when dealing with parquet, we don't have to rewrite the full dataset, we only write the updated data .同样在处理 parquet 时，我们不必重写完整的数据集，我们只写更新的数据。

If you don't need any of this, then maybe you can forget about delta tables and focus on pure performance.如果您不需要任何这些，那么也许您可以忘记增量表并专注于纯粹的性能。