使用 pyspark 从字符中删除重音符号

Question

我的数据中有重音，想从字符中删除。 示例：Frédér8ic@ --> frederic 使用 Pyspark 代码

我尝试了以下代码：

def simplify(text):
    import unicodedata
    try:
        text = unicode(text, 'utf-8')
    except NameError:
        pass
    text = unicodedata.normalize('NFD', text).encode('ascii', 'ignore').decode("utf-8")
    return str(text)

但低于错误

text = unicode(text, 'utf-8')
TypeError: decoding str is not supported

Answer 1

def make_trans():
    matching_string = ""
    replace_string = ""

    for i in range(ord(" "), sys.maxunicode):
        name = unicodedata.name(chr(i), "")
        if "WITH" in name:
            try:
                base = unicodedata.lookup(name.split(" WITH")[0])
                matching_string += chr(i)
                replace_string += base
            except KeyError:
                pass

    return matching_string, replace_string


def clean_text(c):
    matching_string, replace_string = make_trans()
    return translate(
        regexp_replace(c, "\p{M}", ""),
        matching_string, replace_string
    ).alias(c)

使用 pyspark 从字符中删除重音符号

问题描述

1 个解决方案

解决方案1
0 2021-07-12 04:58:53

使用 pyspark 从字符中删除重音符号

问题描述

1 个解决方案

解决方案1 0 2021-07-12 04:58:53

解决方案1
0 2021-07-12 04:58:53