[英]how to cast to string in python 2.7 in spark udf and pandas
i have this problem, i write spark code in python 2.7 it is a udf but when i pass the column that i want to handle, this error appears我有这个问题,我在 python 2.7 中编写了 spark 代码,它是一个 udf 但是当我通过我想要处理的列时,会出现这个错误
UnicodeEncodeError: 'ascii' codec can't encode character u'\xd0'
and this is my spark udf :这是我的火花 udf :
def validate_rule(string):
search_list=[" ",'!','%','$',"<",">","^",'¡',"+","N/A",u'¿','~','#','Ñ',"Ã","Ń","Ë","ó",'Ë','*','?',"ILEGIBLE", "VICIBLE","VISIBLE","INCOMPLETO"]
str_temp = string
if str_temp.upper() == "BORRADO":
return 1
elif len(str_temp) < 6:
return 1
elif any(ext in str_temp.upper()for ext in search_list):
return 1
else:
return 0
df_ =df.withColumn("data",validate_rule_udf(col("data_to_procces"))
the error appears in:错误出现在:
df_.show() or df_.toPandas()
and also when i use pandas apply funtion with this lamda:并且当我使用熊猫应用此 lamda 的功能时:
pdDF["data_to_procces"].apply(lambda x:validate_rule(x) )
the error appears again.错误再次出现。
I have already used and it has not worked :我已经用过,但没有用:
string.econde("utf-8") and unicode(string, 'utf-8')
complete error完全错误
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
UnicodeDecodeError: 'ascii' 编解码器无法解码位置 0 中的字节 0xc2:序号不在范围内 (128)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
search_list
inside validate_rule
mixes Python 2 strings (which are really just bytes) with Python 2 Unicode strings (sequences of Unicode code points).validate_rule
中的search_list
将 Python 2 字符串(实际上只是字节)与 Python 2 Unicode 字符串(Unicode 代码点序列)混合在一起。 When the 3rd condition, any(ext in str_temp.upper()for ext in search_list)
, is evaluated, str_temp
is a Unicode object (because Spark adheres to the Unicode sandwich rule, whereby it immediately decodes raw bytes into Unicode at the input (so, when loading data) and encodes the Unicode strings back to bytes at the output.当第三个条件
any(ext in str_temp.upper()for ext in search_list)
被评估时, str_temp
是一个 Unicode 对象(因为 Spark 遵守 Unicode 三明治规则,即它在输入时立即将原始字节解码为 Unicode(因此,在加载数据时)并在输出时将 Unicode 字符串编码回字节。
Now, Python 2 will try to convert each ext
to a Unicode object to be able to evaluate the statement ext in str_temp.upper()
.现在,Python 2 将尝试将每个
ext
转换为 Unicode 对象,以便能够评估ext in str_temp.upper()
的语句ext in str_temp.upper()
。 This works fine for the first few characters of search_list
, because they're all in the ASCII range, but fails as soon as it encounters the bytestring '¡'
.这适用于
search_list
的前几个字符,因为它们都在 ASCII 范围内,但是一旦遇到字节串'¡'
就会失败。 With that, it tries to call '¡'.decode(locale.getpreferredencoding())
and that will fail if your locale's preferred encoding is ASCII.有了它,它会尝试调用
'¡'.decode(locale.getpreferredencoding())
并且如果您的语言环境的首选编码是 ASCII,那将会失败。
You might think, “sure, I'll explicitly decode each of those bytestrings using utf-8
”, like this:您可能会想,“当然,我将使用
utf-8
显式解码每个字节串”,如下所示:
any(ext.decode("utf-8") in str_temp.upper()for ext in search_list)
but that won't work either, because a little bit later in the search_list
is a Unicode object: u'¿'
.但这也行不通,因为
search_list
是一个 Unicode 对象: u'¿'
。 Calling decode
on Unicode objects makes no sense.对 Unicode 对象调用
decode
没有任何意义。 What will happen here is that Python 2 will implicitly convert (so, encode) your Unicode object back to bytes, because it recognizes you want to call decode
on an object and that's only intended for bytestrings.这里会发生什么是 Python 2 将隐式转换(因此,编码)您的 Unicode 对象回字节,因为它识别出您想要对对象调用
decode
并且仅用于字节串。 However u'¿'.encode("ascii")
won't work because the ASCII codec indeed does not have a reference for the inverted question mark codepoint.但是
u'¿'.encode("ascii")
将不起作用,因为 ASCII 编解码器确实没有对倒问号代码点的引用。
u
in front of Unicode strings.u
。search_list
properly encoded to Unicode objects.search_list
所有符号正确编码为 Unicode 对象。 This would be the bare minimum you would need to have:def validate_rule(str):
search_list = [" ", '!', '%', '$', "<", ">", "^", u'¡',
"+", "N/A", u'¿', '~', '#', u'Ñ', u"Ã",
u"Ń", u"Ë", u"ó", u'Ë',
'*', '?', "ILEGIBLE", "VICIBLE", "VISIBLE", "INCOMPLETO"]
if str.upper() == "BORRADO":
return 1
elif len(str) < 6:
return 1
elif any(ext in str.upper() for ext in search_list):
return 1
else:
return 0
Here, I got away with not having to prepend every bytestring with u
to create Unicode strings, because the remaining bytestrings are in the ASCII table, which is a subset of UTF-8.在这里,我不必在每个字节串前面加上
u
来创建 Unicode 字符串,因为其余的字节串在 ASCII 表中,它是 UTF-8 的子集。 Still, it's recommended to be explicit and use Unicode strings everywhere in your search_list
.尽管如此,还是建议在
search_list
任何地方都使用 Unicode 字符串。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.