如何在 pyspark 中更改 hdfs 块大小？

Question

I use pySpark to write parquet file.我使用 pySpark 编写拼花文件。 I would like to change the hdfs block size of that file.我想更改该文件的 hdfs 块大小。 I set the block size like this and it doesn't work:我像这样设置块大小但它不起作用：

sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")

Does this have to be set before starting the pySpark job?是否必须在启动 pySpark 作业之前设置？ If so, how to do it.如果是这样，该怎么做。

Answer 1

Try setting it through sc._jsc.hadoopConfiguration() with SparkContext尝试使用SparkContext通过sc._jsc.hadoopConfiguration()设置它

from pyspark import SparkConf, SparkContext 
conf = (SparkConf().setMaster("yarn")) 
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")
txt = sc.parallelize(("Hello", "world", "!"))
txt.saveAsTextFile("hdfs/output/path") #saving output with 128MB block size

in Scala:在斯卡拉：

sc.hadoopConfiguration.set("dfs.block.size", "128m")

Answer 2

I had a similiar issue, but I figured out the issue.我有一个类似的问题，但我想出了这个问题。 It needs a number not "128m".它需要一个数字而不是“128m”。 Therefore this should work (worked for me at least!):因此这应该有效（至少对我有用！）：

block_size = str(1024 * 1024 * 128)
sc._jsc.hadoopConfiguration().set("dfs.block.size", block_size)

Answer 3

You can set blockSize of files that spark write:您可以设置引发写入的文件的 blockSize：

myDataFrame.write.option("parquet.block.size", 256 * 1024 * 1024).parquet(destinationPath)

如何在 pyspark 中更改 hdfs 块大小？

问题描述

3 个解决方案

解决方案1
1 2016-12-04 13:26:06

解决方案2
0 2017-01-20 18:46:40

解决方案3
0 2023-01-03 20:10:42

如何在 pyspark 中更改 hdfs 块大小？

问题描述

3 个解决方案

解决方案1 1 2016-12-04 13:26:06

解决方案2 0 2017-01-20 18:46:40

解决方案3 0 2023-01-03 20:10:42

解决方案1
1 2016-12-04 13:26:06

解决方案2
0 2017-01-20 18:46:40

解决方案3
0 2023-01-03 20:10:42