i have this problem, i write spark code in python 2.7 it is a udf but when i pass the column that i want to handle, this error appears
UnicodeEncodeError: 'ascii' codec can't encode character u'\xd0'
and this is my spark udf :
def validate_rule(string):
search_list=[" ",'!','%','$',"<",">","^",'¡',"+","N/A",u'¿','~','#','Ñ',"Ã","Ń","Ë","ó",'Ë','*','?',"ILEGIBLE", "VICIBLE","VISIBLE","INCOMPLETO"]
str_temp = string
if str_temp.upper() == "BORRADO":
return 1
elif len(str_temp) < 6:
return 1
elif any(ext in str_temp.upper()for ext in search_list):
return 1
else:
return 0
df_ =df.withColumn("data",validate_rule_udf(col("data_to_procces"))
the error appears in:
df_.show() or df_.toPandas()
and also when i use pandas apply funtion with this lamda:
pdDF["data_to_procces"].apply(lambda x:validate_rule(x) )
the error appears again.
I have already used and it has not worked :
string.econde("utf-8") and unicode(string, 'utf-8')
complete error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
search_list
inside validate_rule
mixes Python 2 strings (which are really just bytes) with Python 2 Unicode strings (sequences of Unicode code points). When the 3rd condition, any(ext in str_temp.upper()for ext in search_list)
, is evaluated, str_temp
is a Unicode object (because Spark adheres to the Unicode sandwich rule, whereby it immediately decodes raw bytes into Unicode at the input (so, when loading data) and encodes the Unicode strings back to bytes at the output.
Now, Python 2 will try to convert each ext
to a Unicode object to be able to evaluate the statement ext in str_temp.upper()
. This works fine for the first few characters of search_list
, because they're all in the ASCII range, but fails as soon as it encounters the bytestring '¡'
. With that, it tries to call '¡'.decode(locale.getpreferredencoding())
and that will fail if your locale's preferred encoding is ASCII.
You might think, “sure, I'll explicitly decode each of those bytestrings using utf-8
”, like this:
any(ext.decode("utf-8") in str_temp.upper()for ext in search_list)
but that won't work either, because a little bit later in the search_list
is a Unicode object: u'¿'
. Calling decode
on Unicode objects makes no sense. What will happen here is that Python 2 will implicitly convert (so, encode) your Unicode object back to bytes, because it recognizes you want to call decode
on an object and that's only intended for bytestrings. However u'¿'.encode("ascii")
won't work because the ASCII codec indeed does not have a reference for the inverted question mark codepoint.
u
in front of Unicode strings.search_list
properly encoded to Unicode objects. This would be the bare minimum you would need to have:def validate_rule(str):
search_list = [" ", '!', '%', '$', "<", ">", "^", u'¡',
"+", "N/A", u'¿', '~', '#', u'Ñ', u"Ã",
u"Ń", u"Ë", u"ó", u'Ë',
'*', '?', "ILEGIBLE", "VICIBLE", "VISIBLE", "INCOMPLETO"]
if str.upper() == "BORRADO":
return 1
elif len(str) < 6:
return 1
elif any(ext in str.upper() for ext in search_list):
return 1
else:
return 0
Here, I got away with not having to prepend every bytestring with u
to create Unicode strings, because the remaining bytestrings are in the ASCII table, which is a subset of UTF-8. Still, it's recommended to be explicit and use Unicode strings everywhere in your search_list
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.