如何在 python 2.7 中的 spark udf 和 pandas 中轉換為字符串

Question

我有這個問題，我在 python 2.7 中編寫了 spark 代碼，它是一個 udf 但是當我通過我想要處理的列時，會出現這個錯誤

UnicodeEncodeError: 'ascii' codec can't encode character u'\xd0'

這是我的火花 udf ：

def validate_rule(string):
  search_list=[" ",'!','%','$',"<",">","^",'¡',"+","N/A",u'¿','~','#','Ñ',"Ã","Åƒ","Ã‹","Ã³",'Ë','*','?',"ILEGIBLE", "VICIBLE","VISIBLE","INCOMPLETO"]
  str_temp = string
  if str_temp.upper() == "BORRADO":
    return 1
  elif len(str_temp) < 6:
    return 1
  elif any(ext in str_temp.upper()for ext in search_list):
    return 1
  else:
    return 0

df_ =df.withColumn("data",validate_rule_udf(col("data_to_procces"))

錯誤出現在：

df_.show() or df_.toPandas()

並且當我使用熊貓應用此 lamda 的功能時：

pdDF["data_to_procces"].apply(lambda x:validate_rule(x) )

錯誤再次出現。

我已經用過，但沒有用：

string.econde("utf-8") and unicode(string, 'utf-8')

完全錯誤

UnicodeDecodeError: 'ascii' 編解碼器無法解碼位置 0 中的字節 0xc2：序號不在范圍內 (128)

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)

Answer 1

如果您使用的是 Python 3，您就不會遇到這個問題。
導致此問題的原因是validate_rule中的search_list將 Python 2 字符串（實際上只是字節）與 Python 2 Unicode 字符串（Unicode 代碼點序列）混合在一起。

當第三個條件any(ext in str_temp.upper()for ext in search_list)被評估時， str_temp是一個 Unicode 對象（因為 Spark 遵守 Unicode 三明治規則，即它在輸入時立即將原始字節解碼為 Unicode（因此，在加載數據時）並在輸出時將 Unicode 字符串編碼回字節。

現在，Python 2 將嘗試將每個ext轉換為 Unicode 對象，以便能夠評估ext in str_temp.upper()的語句ext in str_temp.upper() 。 這適用於search_list的前幾個字符，因為它們都在 ASCII 范圍內，但是一旦遇到字節串'¡'就會失敗。 有了它，它會嘗試調用'¡'.decode(locale.getpreferredencoding())並且如果您的語言環境的首選編碼是 ASCII，那將會失敗。

您可能會想，“當然，我將使用utf-8顯式解碼每個字節串”，如下所示：

any(ext.decode("utf-8") in str_temp.upper()for ext in search_list)

但這也行不通，因為search_list是一個 Unicode 對象： u'¿' 。 對 Unicode 對象調用decode沒有任何意義。 這里會發生什么是 Python 2 將隱式轉換（因此，編碼）您的 Unicode 對象回字節，因為它識別出您想要對對象調用decode並且僅用於字節串。 但是u'¿'.encode("ascii")將不起作用，因為 ASCII 編解碼器確實沒有對倒問號代碼點的引用。

解決方案

在 Python 3 中工作。你真的沒有充分的理由再開始在 Python 2 中開發，因為從 2020 年 1 月 1 日起將不再維護 Python 2。 您的代碼在 Python 3 下運行良好。盡管您不再需要在 Unicode 字符串前使用u 。
將search_list所有符號正確編碼為 Unicode 對象。 這將是您需要的最低限度：

def validate_rule(str):
    search_list = [" ", '!', '%', '$', "<", ">", "^", u'¡', 
                   "+", "N/A", u'¿', '~', '#', u'Ñ', u"Ã",
                   u"Åƒ", u"Ã‹", u"Ã³", u'Ë',
                   '*', '?', "ILEGIBLE", "VICIBLE", "VISIBLE", "INCOMPLETO"]
    if str.upper() == "BORRADO":
        return 1
    elif len(str) < 6:
        return 1
    elif any(ext in str.upper() for ext in search_list):
        return 1
    else:
        return 0

在這里，我不必在每個字節串前面加上u來創建 Unicode 字符串，因為其余的字節串在 ASCII 表中，它是 UTF-8 的子集。 盡管如此，還是建議在search_list任何地方都使用 Unicode 字符串。

如何在 python 2.7 中的 spark udf 和 pandas 中轉換為字符串

問題描述

1 個解決方案

解決方案1
0 2019-12-08 02:37:48

解決方案

如何在 python 2.7 中的 spark udf 和 pandas 中轉換為字符串

問題描述

1 個解決方案

解決方案1 0 2019-12-08 02:37:48

解決方案

解決方案1
0 2019-12-08 02:37:48