I am trying to move data in s3 which is partitioned on a date string at rest(source) to another location where it is partitioned at rest (destination) as year=yyyy/month=mm/day=dd/
While I am able to read the entire source location data in Spark and partition it in the destination format in tmp hdfs, the s3DistCp fails to copy this from hdfs to s3. It fails with OutOnMemory error.
I am trying to write close to 2 million small files (20KB each)
My s3Distcp is running with the following args sudo -H -u hadoop nice -10 bash -c "if hdfs dfs -test -d hdfs:///<source_path>; then /usr/lib/hadoop/bin/hadoop jar /usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar -libjars /usr/share/aws/emr/s3-dist-cp/lib/ -Dmapreduce.job.reduces=30 -Dmapreduce.child.java.opts=Xmx2048m --src hdfs:///<source_path> --dest s3a://<destination_path> --s3ServerSideEncryption;fi"
It fails with
[2020-08-06 14:23:36,038] {bash_operator.py:126} INFO - # java.lang.OutOfMemoryError: Java heap space
[2020-08-06 14:23:36,038] {bash_operator.py:126} INFO - # -XX:OnOutOfMemoryError="kill -9 %p"```
The emr cluster I am running this is
"master_instance_type": "r5d.8xlarge",
"core_instance_type": "r5.2xlarge",
"core_instance_count": "8",
"task_instance_types": [ "r5.2xlarge","m5.4xlarge"],
"task_instance_count": "1000"
Any suggestions what I could increase configurations on s3Distcp for it to be able to copy this without running out of memory?
I ended up running this iteratively, for the said aws stack it was able to handle about 300K files in each iteration without OOM
This is a classic
case where you can use multithread scheduling
capabilities of Spark by setting spark.scheduler.mode=FAIR
and assigning pools
What you need to do is
list
of partitions beforehandAn example shown below:
before doing spark-submit =>
# Create a List of all *possible* partitions like this
# Example S3 prefixes :
# s3://my_bucket/my_table/year=2019/month=02/day=20
# ...
# ...
# s3://my_bucket/my_table/year=2020/month=03/day=15
# ...
# ...
# s3://my_bucket/my_table/year=2020/month=09/day=01
# WE SET `TARGET_PREFIX` as:
TARGET_PREFIX="s3://my_bucket/my_table"
# And Create a List ( till Day=nn part)
# By looping twice
# Increase loop numbers if partition is till hour
aws s3 ls "${TARGET_PREFIX}/"|grep PRE|awk '{print $2}'|while read year_part ;
do
full_year_part="${TARGET_PREFIX}/${year_part}";
aws s3 ls ${full_year_part}|grep PRE|awk '{print $2}'|while read month_part;
do
full_month_part=${full_year_part}${month_part};
aws s3 ls ${full_month_part}|grep PRE|awk -v pref=$full_month_part '{print pref$2}';
done;
done
Once Done, we run this script and save result in a file like this: bash build_year_month_day.sh > s3_<my_table_day_partition>_file.dat
Now We are ready to run spark in multithread
The Spark code would need two things ( other than scheduler.mode=FAIR
1. creating an iterator from the file created above # s3_<my_table_day_partition>_file.dat
2. sc.setLocalProperty
See How It is Done.
A . We read The File in our spark-app Python
year_month_date_index_file = "s3_<my_table_day_partition>_file.dat"
with open(year_month_date_index_file, 'r') as f:
content = f.read()
content_iter = [(idx, c) for idx, c in enumerate(content.split("\n")) if c]
B .And use a slice of 100 Days to fire 100 threads:
# Number of THREADS can be Increased or Decreased
strt = 0
stp = 99
while strt < len(content_iter):
threads_lst = []
path_slices = islice(content_iter, strt, stp)
for s3path in path_slices:
print("PROCESSING FOR PATH {}".format(s3path))
pool_index = int(s3path[0]) # Spark needs a POOL ID
my_addr = s3path[1]
# CALLING `process_in_pool` in each thread
agg_by_day_thread = threading.Thread(target=process_in_pool, args=(pool_index, <additional_args>)) # Pool_index is mandatory argument.
agg_by_day_thread.start() # Start opf Thread
threads_lst.append(agg_by_day_thread)
for process in threads_lst:
process.join() # Wait for All Threads To Finish
strt = stp
stp += 100
Two Things to notice path_slices = islice(content_iter, strt, stp)
=> returns slices of the size (strt - stp)
pool_index = int(s3path[0])
=> the index of content_iter
, we would use this to assign a pool id.
Now The Meat of the code
def process_in_pool(pool_id, <other_arguments>):
sc.setLocalProperty("spark.scheduler.pool", "pool_id_{}".format(str(int(pool_id) % 100)))
As you see we want to restrict threads to 100 pools So, we set spark.scheduler.pool
as pool_idex
%100 Write your actual Transformation/Action in this ` process_in_pool() function
And once done, exit the function by freeing that pool as
...
sc.setLocalProperty("spark.scheduler.pool", None)
return
finally Run you spark-submit like
spark-submit \
-- Other options \
--conf spark.scheduler.mode=FAIR \
--other options \
my_spark_app.py
If tuned with correct executor/core/memory, you would see a huge performance gain.
Same can be done in scala
with concurrent.futures
But that's for another day.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.