[英]Remove Accent accents from characters using pyspark
我的数据中有重音,想从字符中删除。 示例:Frédér8ic@ --> frederic 使用 Pyspark 代码
我尝试了以下代码:
def simplify(text):
import unicodedata
try:
text = unicode(text, 'utf-8')
except NameError:
pass
text = unicodedata.normalize('NFD', text).encode('ascii', 'ignore').decode("utf-8")
return str(text)
但低于错误
text = unicode(text, 'utf-8')
TypeError: decoding str is not supported
def make_trans():
matching_string = ""
replace_string = ""
for i in range(ord(" "), sys.maxunicode):
name = unicodedata.name(chr(i), "")
if "WITH" in name:
try:
base = unicodedata.lookup(name.split(" WITH")[0])
matching_string += chr(i)
replace_string += base
except KeyError:
pass
return matching_string, replace_string
def clean_text(c):
matching_string, replace_string = make_trans()
return translate(
regexp_replace(c, "\p{M}", ""),
matching_string, replace_string
).alias(c)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.