如何在 python 2.7 中的 spark udf 和 pandas 中转换为字符串

Question

我有这个问题，我在 python 2.7 中编写了 spark 代码，它是一个 udf 但是当我通过我想要处理的列时，会出现这个错误

UnicodeEncodeError: 'ascii' codec can't encode character u'\xd0'

这是我的火花 udf ：

def validate_rule(string):
  search_list=[" ",'!','%','$',"<",">","^",'¡',"+","N/A",u'¿','~','#','Ñ',"Ã","Åƒ","Ã‹","Ã³",'Ë','*','?',"ILEGIBLE", "VICIBLE","VISIBLE","INCOMPLETO"]
  str_temp = string
  if str_temp.upper() == "BORRADO":
    return 1
  elif len(str_temp) < 6:
    return 1
  elif any(ext in str_temp.upper()for ext in search_list):
    return 1
  else:
    return 0

df_ =df.withColumn("data",validate_rule_udf(col("data_to_procces"))

错误出现在：

df_.show() or df_.toPandas()

并且当我使用熊猫应用此 lamda 的功能时：

pdDF["data_to_procces"].apply(lambda x:validate_rule(x) )

错误再次出现。

我已经用过，但没有用：

string.econde("utf-8") and unicode(string, 'utf-8')

完全错误

UnicodeDecodeError: 'ascii' 编解码器无法解码位置 0 中的字节 0xc2：序号不在范围内 (128)

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)

Answer 1

如果您使用的是 Python 3，您就不会遇到这个问题。
导致此问题的原因是validate_rule中的search_list将 Python 2 字符串（实际上只是字节）与 Python 2 Unicode 字符串（Unicode 代码点序列）混合在一起。

当第三个条件any(ext in str_temp.upper()for ext in search_list)被评估时， str_temp是一个 Unicode 对象（因为 Spark 遵守 Unicode 三明治规则，即它在输入时立即将原始字节解码为 Unicode（因此，在加载数据时）并在输出时将 Unicode 字符串编码回字节。

现在，Python 2 将尝试将每个ext转换为 Unicode 对象，以便能够评估ext in str_temp.upper()的语句ext in str_temp.upper() 。 这适用于search_list的前几个字符，因为它们都在 ASCII 范围内，但是一旦遇到字节串'¡'就会失败。 有了它，它会尝试调用'¡'.decode(locale.getpreferredencoding())并且如果您的语言环境的首选编码是 ASCII，那将会失败。

您可能会想，“当然，我将使用utf-8显式解码每个字节串”，如下所示：

any(ext.decode("utf-8") in str_temp.upper()for ext in search_list)

但这也行不通，因为search_list是一个 Unicode 对象： u'¿' 。 对 Unicode 对象调用decode没有任何意义。 这里会发生什么是 Python 2 将隐式转换（因此，编码）您的 Unicode 对象回字节，因为它识别出您想要对对象调用decode并且仅用于字节串。 但是u'¿'.encode("ascii")将不起作用，因为 ASCII 编解码器确实没有对倒问号代码点的引用。

解决方案

在 Python 3 中工作。你真的没有充分的理由再开始在 Python 2 中开发，因为从 2020 年 1 月 1 日起将不再维护 Python 2。 您的代码在 Python 3 下运行良好。尽管您不再需要在 Unicode 字符串前使用u 。
将search_list所有符号正确编码为 Unicode 对象。 这将是您需要的最低限度：

def validate_rule(str):
    search_list = [" ", '!', '%', '$', "<", ">", "^", u'¡', 
                   "+", "N/A", u'¿', '~', '#', u'Ñ', u"Ã",
                   u"Åƒ", u"Ã‹", u"Ã³", u'Ë',
                   '*', '?', "ILEGIBLE", "VICIBLE", "VISIBLE", "INCOMPLETO"]
    if str.upper() == "BORRADO":
        return 1
    elif len(str) < 6:
        return 1
    elif any(ext in str.upper() for ext in search_list):
        return 1
    else:
        return 0

在这里，我不必在每个字节串前面加上u来创建 Unicode 字符串，因为其余的字节串在 ASCII 表中，它是 UTF-8 的子集。 尽管如此，还是建议在search_list任何地方都使用 Unicode 字符串。

如何在 python 2.7 中的 spark udf 和 pandas 中转换为字符串

问题描述

1 个解决方案

解决方案1
0 2019-12-08 02:37:48

解决方案

如何在 python 2.7 中的 spark udf 和 pandas 中转换为字符串

问题描述

1 个解决方案

解决方案1 0 2019-12-08 02:37:48

解决方案

解决方案1
0 2019-12-08 02:37:48