简体   繁体   English

如何在pysark中更改DataFrame的HDFS块大小

[英]How to change hdfs block size of DataFrame in pysark

This seems related to 这似乎与

How to change hdfs block size in pyspark? 如何在pyspark中更改hdfs块大小?

I can successfully change the hdfs block size with rdd.saveAsTextFile, but not the corresponding DataFrame.write.parquet and unable to save with parquet format. 我可以使用rdd.saveAsTextFile成功更改hdfs块大小,但不能成功更改相应的DataFrame.write.parquet,并且无法以镶木地板格式保存。

Unsure whether it's the bug in pyspark DataFrame or I did not set the configurations correctly. 不确定是pyspark DataFrame中的错误还是我没有正确设置配置。

The following is my testing code: 以下是我的测试代码:

##########
# init
##########
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

import hdfs
from hdfs import InsecureClient
import os

import numpy as np
import pandas as pd
import logging

os.environ['SPARK_HOME'] = '/opt/spark-2.2.1-bin-hadoop2.7'

block_size = 512 * 1024

conf = SparkConf().setAppName("myapp").setMaster("spark://spark1:7077").set('spark.cores.max', 20).set("spark.executor.cores", 10).set("spark.executor.memory", "10g").set("spark.hadoop.dfs.blocksize", str(block_size)).set("spark.hadoop.dfs.block.size", str(block_size))

spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.blocksize", block_size)
spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.block.size", block_size)

##########
# main
##########

# create DataFrame
df_txt = spark.createDataFrame([\{'temp': "hello"}, \{'temp': "world"}, \{'temp': "!"}])

# save using DataFrameWriter, resulting 128MB-block-size

df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df')

# save using rdd, resulting 512k-block-size
client = InsecureClient('http://spark1:50070')
client.delete('/tmp/temp_with_rrd', recursive=True)
df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd')

Hadoop and Spark are two independent tools which have their own strategies to work. Hadoop和Spark是两个独立的工具,它们具有各自的工作策略。 Spark and Parquet work with data partitions and block size is not meaningful for them. Spark和Parquet使用数据分区,而块大小对它们没有意义。 Do what Spark does say and then do what you want with it inside the HDFS. 按照Spark的说明进行操作,然后在HDFS中使用它进行所需的操作。

You can change your Parquet partition number by 您可以通过以下方式更改Parquet分区号:

df_txt.repartition(6).format("parquet").save("hdfs://...")

Found the answer from the following link: 通过以下链接找到了答案:

https://forums.databricks.com/questions/918/how-to-set-size-of-parquet-output-files.html https://forums.databricks.com/questions/918/how-to-set-size-of-parquet-output-files.html

I can successfully setup parquet block size with spark.hadoop.parquet.block.size 我可以使用spark.hadoop.parquet.block.size成功设置镶木地板块尺寸

The following is the sample code: 以下是示例代码:

# init
block_size = 512 * 1024 

conf = SparkConf().setAppName("myapp").setMaster("spark://spark1:7077").set('spark.cores.max', 20).set("spark.executor.cores", 10).set("spark.executor.memory", "10g").set('spark.hadoop.parquet.block.size', str(block_size)).set("spark.hadoop.dfs.blocksize", str(block_size)).set("spark.hadoop.dfs.block.size", str(block_size)).set("spark.hadoop.dfs.namenode.fs-limits.min-block-size", str(131072))

sc = SparkContext(conf=conf) 
spark = SparkSession(sc) 

# create DataFrame 
df_txt = spark.createDataFrame([{'temp': "hello"}, {'temp': "world"}, {'temp': "!"}]) 

# save using DataFrameWriter, resulting 512k-block-size 

df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df') df_txt.write.mode('overwrite').format('parquet').save('hdfs:// spark1 / tmp / temp_with_df')

# save using DataFrameWriter.csv, resulting 512k-block-size 
df_txt.write.mode('overwrite').csv('hdfs://spark1/tmp/temp_with_df_csv') 

# save using DataFrameWriter.text, resulting 512k-block-size

df_txt.write.mode('overwrite').text('hdfs://spark1/tmp/temp_with_df_text') df_txt.write.mode('overwrite').text('hdfs:// spark1 / tmp / temp_with_df_text')

# save using rdd, resulting 512k-block-size 
client = InsecureClient('http://spark1:50070') 
client.delete('/tmp/temp_with_rrd', recursive=True) 
df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM