OutOfMemory error when writing to s3a through EMR

Question

Getting an OutOfMemory error for the following PySpark code: (fails after a certain number of rows are written. This does not happen if I attempt to write to the hadoop filesystem instead of using s3a, so I think I've narrowed it down to the problem being s3a. ) - end goal to write to s3a. Was wondering if there was an optimal s3a configuration where I will not run out of memory for extremely large tables.

df = spark.sql("SELECT * FROM my_big_table")
df.repartition(1).write.option("header", "true").csv("s3a://mycsvlocation/folder/")

my s3a configurations (emr default):

('fs.s3a.attempts.maximum', '10')
('fs.s3a.buffer.dir', '${hadoop.tmp.dir}/s3a')
('fs.s3a.connection.establish.timeout', '5000')
('fs.s3a.connection.maximum', '15')
('fs.s3a.connection.ssl.enabled', 'true')
('fs.s3a.connection.timeout', '50000')
('fs.s3a.fast.buffer.size', '1048576')
('fs.s3a.fast.upload', 'true')
('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
('fs.s3a.max.total.tasks', '1000')
('fs.s3a.multipart.purge', 'false')
('fs.s3a.multipart.purge.age', '86400')
('fs.s3a.multipart.size', '104857600')
('fs.s3a.multipart.threshold', '2147483647')
('fs.s3a.paging.maximum', '5000')
('fs.s3a.threads.core', '15')
('fs.s3a.threads.keepalivetime', '60')
('fs.s3a.threads.max', '256')
('mapreduce.fileoutputcommitter.algorithm.version', '2')
('spark.authenticate', 'true')
('spark.network.crypto.enabled', 'true')
('spark.network.crypto.saslFallback', 'true')
('spark.speculation', 'false')

base of the stack trace:

Caused by: java.lang.OutOfMemoryError
        at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
        at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
        at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
        at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
        at org.apache.hadoop.fs.s3a.S3AFastOutputStream.write(S3AFastOutputStream.java:194)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:60)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
        at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
        at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
        at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)
        at com.univocity.parsers.common.input.WriterCharAppender.writeCharsAndReset(WriterCharAppender.java:152)
        at com.univocity.parsers.common.AbstractWriter.writeRow(AbstractWriter.java:808)
        ... 16 more

Answer 1

The problem here is that the default s3a upload does not support uploading of a singular large file greater than 2GB or 2147483647 bytes.

('fs.s3a.multipart.threshold', '2147483647')

My EMR version is older than the more recent ones so the multipart.threshold parameter is only an integer, thus the limit is 2147483647 bytes, for a single "part" or file. The more recent versions use long instead of int and can support a larger single file size limit.

I'll be using a work around write the file to local hdfs then moving it to s3 via a separate java program.

OutOfMemory error when writing to s3a through EMR

Question

1 answers

solution1
1 ACCPTED 2020-09-01 03:07:28

OutOfMemory error when writing to s3a through EMR

Question

1 answers

solution1 1 ACCPTED 2020-09-01 03:07:28

solution1
1 ACCPTED 2020-09-01 03:07:28