如何在 python 2.7 中的 spark udf 和 pandas 中转换为字符串

Question

i have this problem, i write spark code in python 2.7 it is a udf but when i pass the column that i want to handle, this error appears我有这个问题，我在 python 2.7 中编写了 spark 代码，它是一个 udf 但是当我通过我想要处理的列时，会出现这个错误

UnicodeEncodeError: 'ascii' codec can't encode character u'\xd0'

and this is my spark udf :这是我的火花 udf ：

def validate_rule(string):
  search_list=[" ",'!','%','$',"<",">","^",'¡',"+","N/A",u'¿','~','#','Ñ',"Ã","Åƒ","Ã‹","Ã³",'Ë','*','?',"ILEGIBLE", "VICIBLE","VISIBLE","INCOMPLETO"]
  str_temp = string
  if str_temp.upper() == "BORRADO":
    return 1
  elif len(str_temp) < 6:
    return 1
  elif any(ext in str_temp.upper()for ext in search_list):
    return 1
  else:
    return 0

df_ =df.withColumn("data",validate_rule_udf(col("data_to_procces"))

the error appears in:错误出现在：

df_.show() or df_.toPandas()

and also when i use pandas apply funtion with this lamda:并且当我使用熊猫应用此 lamda 的功能时：

pdDF["data_to_procces"].apply(lambda x:validate_rule(x) )

the error appears again.错误再次出现。

I have already used and it has not worked :我已经用过，但没有用：

string.econde("utf-8") and unicode(string, 'utf-8')

complete error完全错误

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128) UnicodeDecodeError: 'ascii' 编解码器无法解码位置 0 中的字节 0xc2：序号不在范围内 (128)

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)

Answer 1

You wouldn't be experiencing this problem if you were using Python 3.如果您使用的是 Python 3，您就不会遇到这个问题。
The cause of this issue is that your search_list inside validate_rule mixes Python 2 strings (which are really just bytes) with Python 2 Unicode strings (sequences of Unicode code points).导致此问题的原因是validate_rule中的search_list将 Python 2 字符串（实际上只是字节）与 Python 2 Unicode 字符串（Unicode 代码点序列）混合在一起。

When the 3rd condition, any(ext in str_temp.upper()for ext in search_list) , is evaluated, str_temp is a Unicode object (because Spark adheres to the Unicode sandwich rule, whereby it immediately decodes raw bytes into Unicode at the input (so, when loading data) and encodes the Unicode strings back to bytes at the output.当第三个条件any(ext in str_temp.upper()for ext in search_list)被评估时， str_temp是一个 Unicode 对象（因为 Spark 遵守 Unicode 三明治规则，即它在输入时立即将原始字节解码为 Unicode（因此，在加载数据时）并在输出时将 Unicode 字符串编码回字节。

Now, Python 2 will try to convert each ext to a Unicode object to be able to evaluate the statement ext in str_temp.upper() .现在，Python 2 将尝试将每个ext转换为 Unicode 对象，以便能够评估ext in str_temp.upper()的语句ext in str_temp.upper() 。 This works fine for the first few characters of search_list , because they're all in the ASCII range, but fails as soon as it encounters the bytestring '¡' .这适用于search_list的前几个字符，因为它们都在 ASCII 范围内，但是一旦遇到字节串'¡'就会失败。 With that, it tries to call '¡'.decode(locale.getpreferredencoding()) and that will fail if your locale's preferred encoding is ASCII.有了它，它会尝试调用'¡'.decode(locale.getpreferredencoding())并且如果您的语言环境的首选编码是 ASCII，那将会失败。

You might think, “sure, I'll explicitly decode each of those bytestrings using utf-8 ”, like this:您可能会想，“当然，我将使用utf-8显式解码每个字节串”，如下所示：

any(ext.decode("utf-8") in str_temp.upper()for ext in search_list)

but that won't work either, because a little bit later in the search_list is a Unicode object: u'¿' .但这也行不通，因为search_list是一个 Unicode 对象： u'¿' 。 Calling decode on Unicode objects makes no sense.对 Unicode 对象调用decode没有任何意义。 What will happen here is that Python 2 will implicitly convert (so, encode) your Unicode object back to bytes, because it recognizes you want to call decode on an object and that's only intended for bytestrings.这里会发生什么是 Python 2 将隐式转换（因此，编码）您的 Unicode 对象回字节，因为它识别出您想要对对象调用decode并且仅用于字节串。 However u'¿'.encode("ascii") won't work because the ASCII codec indeed does not have a reference for the inverted question mark codepoint.但是u'¿'.encode("ascii")将不起作用，因为 ASCII 编解码器确实没有对倒问号代码点的引用。

Solutions解决方案

Work in Python 3. You really have no good reason to start developing in Python 2 anymore, as Python 2 will no longer be maintained from Jan 1st, 2020 on.在 Python 3 中工作。你真的没有充分的理由再开始在 Python 2 中开发，因为从 2020 年 1 月 1 日起将不再维护 Python 2。 Your code works perfectly under Python 3. Although you should no longer need to use u in front of Unicode strings.您的代码在 Python 3 下运行良好。尽管您不再需要在 Unicode 字符串前使用u 。
Have all of the symbols in search_list properly encoded to Unicode objects.将search_list所有符号正确编码为 Unicode 对象。 This would be the bare minimum you would need to have:这将是您需要的最低限度：

def validate_rule(str):
    search_list = [" ", '!', '%', '$', "<", ">", "^", u'¡', 
                   "+", "N/A", u'¿', '~', '#', u'Ñ', u"Ã",
                   u"Åƒ", u"Ã‹", u"Ã³", u'Ë',
                   '*', '?', "ILEGIBLE", "VICIBLE", "VISIBLE", "INCOMPLETO"]
    if str.upper() == "BORRADO":
        return 1
    elif len(str) < 6:
        return 1
    elif any(ext in str.upper() for ext in search_list):
        return 1
    else:
        return 0

Here, I got away with not having to prepend every bytestring with u to create Unicode strings, because the remaining bytestrings are in the ASCII table, which is a subset of UTF-8.在这里，我不必在每个字节串前面加上u来创建 Unicode 字符串，因为其余的字节串在 ASCII 表中，它是 UTF-8 的子集。 Still, it's recommended to be explicit and use Unicode strings everywhere in your search_list .尽管如此，还是建议在search_list任何地方都使用 Unicode 字符串。

如何在 python 2.7 中的 spark udf 和 pandas 中转换为字符串

问题描述

1 个解决方案

解决方案1
0 2019-12-08 02:37:48

Solutions解决方案

如何在 python 2.7 中的 spark udf 和 pandas 中转换为字符串

问题描述

1 个解决方案

解决方案1 0 2019-12-08 02:37:48

Solutions解决方案

解决方案1
0 2019-12-08 02:37:48