使用pyspark分区时循环覆盖模式

Question

#Start and End is a range of dates. 
start = date(2019, 1, 20)
end = date(2019, 1, 22)

for single_date in daterange(start, end):
  query = "(SELECT ID, firstname,lastname,date FROM dbo.emp WHERE date = '%s' ) emp_alias" %((single_date).strftime("%Y-%m-%d %H:%M:%S")) 
  df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)
  df.write.format("parquet").mode("ignore").partitionBy("Date").save("/mnt/data/empData.parquet")

I have data for number of days in a table and I need as parquet files partitioned by date.我有一个表中的天数数据，我需要按日期分区的镶木地板文件。 I have to save by day in loop as data is huge and I can't put all the days like years data in one dataframe.我必须按天循环保存，因为数据很大，而且我不能将所有日子都像年份数据一样放在一个数据框中。 I tried on all save modes.我尝试了所有保存模式。 In 'Ignore' mode it saves for first day.在“忽略”模式下，它会保存第一天。 In 'Overwrite' mode, it saves of last day.在“覆盖”模式下，它保存最后一天。 In 'append' mode, it adds the data.在“追加”模式下，它添加数据。 What I need is, if data is available for that day it should ignore for that day and leave the data what's already there but if data is not available then create in parquet file partitioned by date.我需要的是，如果当天的数据可用，则应忽略当天的数据并保留已有的数据，但如果数据不可用，则在按日期分区的镶木地板文件中创建。 Please help.请帮忙。

Answer 1

There is currently no PySpark SaveMode that will allow you to preserve the existing partitions, while inserting the new ones, if you also want to use Hive partitioning (which is what you're asking for, when you call the method partitionBy ).当前没有 PySpark SaveMode 允许您在插入新分区的同时保留现有分区，如果您还想使用 Hive 分区（这是您在调用partitionBy方法时所要求的）。 Note that there is the option to do the opposite, which is to overwrite data in some partitions, while preserving the ones for which there is no data in the DataFrame (set the configuration setting "spark.sql.sources.partitionOverwriteMode" to "dynamic" and use SaveMode.Overwrite when writing datasets).请注意，有一个选项可以做相反的事情，即覆盖某些分区中的数据，同时保留 DataFrame 中没有数据的那些（将配置设置"spark.sql.sources.partitionOverwriteMode"设置为"dynamic"并在写入数据集时使用SaveMode.Overwrite ）。

You can still achieve what you want though, by first creating a set of all the already existing partitions.您仍然可以通过首先创建一组所有现有分区来实现您想要的。 You could do that with PySpark, or using any of the libraries that will allow you to perform listing operations in file systems (like Azure Data Lake Storage Gen2) or key-value stores (like AWS S3).您可以使用 PySpark 或使用任何允许您在文件系统（如 Azure Data Lake Storage Gen2）或键值存储（如 AWS S3）中执行列表操作的库来做到这一点。 Once you have that list, you use it to filter the new dataset for the data you still want to write.获得该列表后，您可以使用它来过滤新数据集以查找仍要写入的数据。 Here's an example with only PySpark:这是一个只有 PySpark 的例子：

In [1]: from pyspark.sql.functions import lit
   ...: df = spark.range(3).withColumn("foo", lit("bar"))
   ...: dir = "/tmp/foo"
   ...: df.write.mode("overwrite").partitionBy("id").parquet(dir)  # initial seeding
   ...: ! tree /tmp/foo
   ...: 
   ...: 
/tmp/foo                                                                        
├── id=0
│   └── part-00001-5d14d286-81e1-4eb1-969e-c0d8089712ce.c000.snappy.parquet
├── id=1
│   └── part-00002-5d14d286-81e1-4eb1-969e-c0d8089712ce.c000.snappy.parquet
├── id=2
│   └── part-00003-5d14d286-81e1-4eb1-969e-c0d8089712ce.c000.snappy.parquet
└── _SUCCESS

3 directories, 4 files

In [2]: df2 = spark.range(5).withColumn("foo", lit("baz"))
   ...: existing_partitions = spark.read.parquet(dir).select("id").distinct()
   ...: df3 = df2.join(existing_partitions, "id", how="left_anti")
   ...: df3.write.mode("append").partitionBy("id").parquet(dir)
   ...: spark.read.parquet(dir).orderBy("id").show()
   ...: 
   ...: 
+---+---+                                                                       
|foo| id|
+---+---+
|bar|  0|
|bar|  1|
|bar|  2|
|baz|  3|
|baz|  4|
+---+---+

As you can see, only 2 partitions were added.如您所见，只添加了 2 个分区。 The ones that were already existing have been preserved.已经存在的那些被保留了下来。

Now, getting the existing_partitions DataFrame required a read of the data.现在，获取existing_partitions DataFrame 需要读取数据。 Spark won't actually read all of the data though, just the partitioning column and the metadata.不过，Spark 实际上不会读取所有数据，只会读取分区列和元数据。 As mentioned earlier, you could get this data as well using any of the APIs relevant to where your data is stored.如前所述，您也可以使用与数据存储位置相关的任何 API 来获取这些数据。 In my particular case as well as in yours, seeing as how you're writing to the /mnt folder on Databricks, I could've simply used the built-in Python function os.walk : dirnames = next(os.walk(dir))[1] , and created a DataFrame from that.在我和你的特殊情况下，看到你如何写入 Databricks 上的/mnt文件夹，我可以简单地使用内置的 Python 函数os.walk ： dirnames = next(os.walk(dir))[1] ，并从中创建了一个 DataFrame 。

By the way, the reason you get the behaviours you've seen is:顺便说一下，你得到你所看到的行为的原因是：

ignore mode忽略模式

In 'Ignore' mode it saves for first day.在“忽略”模式下，它会保存第一天。

Because you're using a for-loop and the output directory was initially probably non-existing, the first date partition will be written.因为您使用的是 for 循环并且输出目录最初可能不存在，所以将写入第一个日期分区。 In all subsequent iterations of the for-loop, the DataFrameWriter object will not write anymore, because it ses there's already some data (one partitione, for the first date) there.在 for 循环的所有后续迭代中，DataFrameWriter 对象将不再写入，因为它认为那里已经有一些数据（一个分区，第一个日期）。
overwrite mode覆盖模式

In 'Overwrite' mode, it saves of last day.在“覆盖”模式下，它保存最后一天。

Actually, it saved a partition in each iteration of the for-loop, but because you're instructing the DataFrameWriter to overwrite, it will remove all previously existing partitions in the directory.实际上，它在 for 循环的每次迭代中都保存了一个分区，但是因为您指示 DataFrameWriter 进行覆盖，它将删除目录中所有先前存在的分区。 So it looks like only the last one was written.所以看起来只写了最后一个。
append mode追加模式

In 'append' mode, it adds the data This doesn't need further explanation.在'append'模式下，它添加数据。这个不需要进一步解释。

One suggestion: there's probably no need to read from the database multiple times (using the for-loop to create multiple different queries and jdbc-connections).一个建议：可能不需要多次从数据库读取（使用 for 循环创建多个不同的查询和 jdbc 连接）。 You could probably update the query to have WHERE date BETWEEN %(start) AND %(end) , remove the for-loop altogether and enjoy an efficient write.您可能会更新查询以包含WHERE date BETWEEN %(start) AND %(end) ，完全删除 for 循环并享受高效写入。

使用pyspark分区时循环覆盖模式

问题描述

1 个解决方案

解决方案1
0 2019-12-14 23:53:44

使用pyspark分区时循环覆盖模式

问题描述

1 个解决方案

解决方案1 0 2019-12-14 23:53:44

解决方案1
0 2019-12-14 23:53:44