PySpark — UnicodeEncodeError: 'ascii' codec can't encode character

Question

Loading a dataframe with foreign characters (åäö) into Spark using spark.read.csv , with encoding='utf-8' and trying to do a simple show().

>>> df.show()

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 287, in show
print(self._jdf.showString(n, truncate))
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 579: ordinal not in range(128)

I figure this is probably related to Python itself but I cannot understand how any of the tricks that are mentioned here for example can be applied in the context of PySpark and the show()-function.

Answer 1

https://issues.apache.org/jira/browse/SPARK-11772 talks about this issue and gives a solution that runs:

export PYTHONIOENCODING=utf8

before running pyspark . I wonder why above works, because sys.getdefaultencoding() returned utf-8 for me even without it.

How to set sys.stdout encoding in Python 3? also talks about this and gives the following solution for Python 3:

import sys
sys.stdout = open(sys.stdout.fileno(), mode='w', encoding='utf8', buffering=1)

Answer 2

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

This works for me, I am setting the encoding upfront and it is valid throughout the script.

Answer 3

I faced the same issue with the following version of Spark and Python:

SPARK - 2.4.0

Python - 2.7.5

None of the above solutions worked for me.

For me, the issue was happening while trying to save the result RDD to HDFS location. I was taking the input from HDFS location and saving the same to HDFS location. Following was the code used for the read and write operations when this issue came up:

Reading input data:

monthly_input = sc.textFile(monthly_input_location).map(lambda i: i.split("\x01"))
monthly_input_df = sqlContext.createDataFrame(monthly_input, monthly_input_schema)

Writing to HDFS:

result = output_df.rdd.map(tuple).map(lambda line: "\x01".join([str(i) for i in line]))
result.saveAsTextFile(output_location)

I changed the reading and writing code respectively to below code:

Reading code:

monthly_input = sqlContext.read.format("csv").option('encoding', 'UTF-8').option("header", "true").option("delimiter", "\x01").schema(monthly_input_schema).load(monthly_input_location)

Writing Code:

output_df.write.format('csv').option("header", "false").option("delimiter", "\x01").save(output_location)

Not only this solved the issue, it improved the IO performance by a great deal(Almost 3 times).

But there are one known issue while using the write logic above, which I am yet to figure out a proper solution. If there are blank field in output, due to the CSV encoding, it will show the blank value enclosed in double quotes("").

For me that issue is currently not a big deal. I am loading the output to hive anyway and there the double quotes can be removed while importing itself.

PS: I am still using SQLContext. Yet to upgrade to SparkSession. But from what I tried so far similar read and write operation in SparkSession based code also will work similarly.

PySpark — UnicodeEncodeError: 'ascii' codec can't encode character

Question

3 answers

solution1
28 2017-06-15 12:30:55

solution2
6 2019-06-22 16:32:37

solution3
0 2020-04-25 06:44:49

PySpark — UnicodeEncodeError: 'ascii' codec can't encode character

Question

3 answers

solution1 28 2017-06-15 12:30:55

solution2 6 2019-06-22 16:32:37

solution3 0 2020-04-25 06:44:49

solution1
28 2017-06-15 12:30:55

solution2
6 2019-06-22 16:32:37

solution3
0 2020-04-25 06:44:49