简体   繁体   中英

how to cast to string in python 2.7 in spark udf and pandas

i have this problem, i write spark code in python 2.7 it is a udf but when i pass the column that i want to handle, this error appears

UnicodeEncodeError: 'ascii' codec can't encode character u'\xd0'

and this is my spark udf :

def validate_rule(string):
  search_list=[" ",'!','%','$',"<",">","^",'¡',"+","N/A",u'¿','~','#','Ñ',"Ã","Ń","Ë","ó",'Ë','*','?',"ILEGIBLE", "VICIBLE","VISIBLE","INCOMPLETO"]
  str_temp = string
  if str_temp.upper() == "BORRADO":
    return 1
  elif len(str_temp) < 6:
    return 1
  elif any(ext in str_temp.upper()for ext in search_list):
    return 1
  else:
    return 0

df_ =df.withColumn("data",validate_rule_udf(col("data_to_procces"))

the error appears in:

df_.show() or df_.toPandas()

and also when i use pandas apply funtion with this lamda:

pdDF["data_to_procces"].apply(lambda x:validate_rule(x) )

the error appears again.

I have already used and it has not worked :

string.econde("utf-8") and unicode(string, 'utf-8')

complete error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
  1. You wouldn't be experiencing this problem if you were using Python 3.
  2. The cause of this issue is that your search_list inside validate_rule mixes Python 2 strings (which are really just bytes) with Python 2 Unicode strings (sequences of Unicode code points).

When the 3rd condition, any(ext in str_temp.upper()for ext in search_list) , is evaluated, str_temp is a Unicode object (because Spark adheres to the Unicode sandwich rule, whereby it immediately decodes raw bytes into Unicode at the input (so, when loading data) and encodes the Unicode strings back to bytes at the output.

Now, Python 2 will try to convert each ext to a Unicode object to be able to evaluate the statement ext in str_temp.upper() . This works fine for the first few characters of search_list , because they're all in the ASCII range, but fails as soon as it encounters the bytestring '¡' . With that, it tries to call '¡'.decode(locale.getpreferredencoding()) and that will fail if your locale's preferred encoding is ASCII.

You might think, “sure, I'll explicitly decode each of those bytestrings using utf-8 ”, like this:

any(ext.decode("utf-8") in str_temp.upper()for ext in search_list)

but that won't work either, because a little bit later in the search_list is a Unicode object: u'¿' . Calling decode on Unicode objects makes no sense. What will happen here is that Python 2 will implicitly convert (so, encode) your Unicode object back to bytes, because it recognizes you want to call decode on an object and that's only intended for bytestrings. However u'¿'.encode("ascii") won't work because the ASCII codec indeed does not have a reference for the inverted question mark codepoint.

Solutions

  • Work in Python 3. You really have no good reason to start developing in Python 2 anymore, as Python 2 will no longer be maintained from Jan 1st, 2020 on. Your code works perfectly under Python 3. Although you should no longer need to use u in front of Unicode strings.
  • Have all of the symbols in search_list properly encoded to Unicode objects. This would be the bare minimum you would need to have:
def validate_rule(str):
    search_list = [" ", '!', '%', '$', "<", ">", "^", u'¡', 
                   "+", "N/A", u'¿', '~', '#', u'Ñ', u"Ã",
                   u"Ń", u"Ë", u"ó", u'Ë',
                   '*', '?', "ILEGIBLE", "VICIBLE", "VISIBLE", "INCOMPLETO"]
    if str.upper() == "BORRADO":
        return 1
    elif len(str) < 6:
        return 1
    elif any(ext in str.upper() for ext in search_list):
        return 1
    else:
        return 0

Here, I got away with not having to prepend every bytestring with u to create Unicode strings, because the remaining bytestrings are in the ASCII table, which is a subset of UTF-8. Still, it's recommended to be explicit and use Unicode strings everywhere in your search_list .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM