[英]How to change hdfs block size in pyspark?
I use pySpark to write parquet file.我使用 pySpark 编写拼花文件。 I would like to change the hdfs block size of that file.
我想更改该文件的 hdfs 块大小。 I set the block size like this and it doesn't work:
我像这样设置块大小但它不起作用:
sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")
Does this have to be set before starting the pySpark job?是否必须在启动 pySpark 作业之前设置? If so, how to do it.
如果是这样,该怎么做。
Try setting it through sc._jsc.hadoopConfiguration()
with SparkContext尝试使用SparkContext通过
sc._jsc.hadoopConfiguration()
设置它
from pyspark import SparkConf, SparkContext
conf = (SparkConf().setMaster("yarn"))
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")
txt = sc.parallelize(("Hello", "world", "!"))
txt.saveAsTextFile("hdfs/output/path") #saving output with 128MB block size
in Scala:在斯卡拉:
sc.hadoopConfiguration.set("dfs.block.size", "128m")
I had a similiar issue, but I figured out the issue.我有一个类似的问题,但我想出了这个问题。 It needs a number not "128m".
它需要一个数字而不是“128m”。 Therefore this should work (worked for me at least!):
因此这应该有效(至少对我有用!):
block_size = str(1024 * 1024 * 128)
sc._jsc.hadoopConfiguration().set("dfs.block.size", block_size)
You can set blockSize of files that spark write:您可以设置引发写入的文件的 blockSize:
myDataFrame.write.option("parquet.block.size", 256 * 1024 * 1024).parquet(destinationPath)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.