简体   繁体   English

如何在 pyspark 中更改 hdfs 块大小?

[英]How to change hdfs block size in pyspark?

I use pySpark to write parquet file.我使用 pySpark 编写拼花文件。 I would like to change the hdfs block size of that file.我想更改该文件的 hdfs 块大小。 I set the block size like this and it doesn't work:我像这样设置块大小但它不起作用:

sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")

Does this have to be set before starting the pySpark job?是否必须在启动 pySpark 作业之前设置? If so, how to do it.如果是这样,该怎么做。

Try setting it through sc._jsc.hadoopConfiguration() with SparkContext尝试使用SparkContext通过sc._jsc.hadoopConfiguration()设置它

from pyspark import SparkConf, SparkContext 
conf = (SparkConf().setMaster("yarn")) 
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")
txt = sc.parallelize(("Hello", "world", "!"))
txt.saveAsTextFile("hdfs/output/path") #saving output with 128MB block size

in Scala:在斯卡拉:

sc.hadoopConfiguration.set("dfs.block.size", "128m")

I had a similiar issue, but I figured out the issue.我有一个类似的问题,但我想出了这个问题。 It needs a number not "128m".它需要一个数字而不是“128m”。 Therefore this should work (worked for me at least!):因此这应该有效(至少对我有用!):

block_size = str(1024 * 1024 * 128)
sc._jsc.hadoopConfiguration().set("dfs.block.size", block_size)

You can set blockSize of files that spark write:您可以设置引发写入的文件的 blockSize:

myDataFrame.write.option("parquet.block.size", 256 * 1024 * 1024).parquet(destinationPath)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM