简体   繁体   中英

How to unaccent special characters in PySpark?

I have a spark df with a string column with special characters such as áãâàéêèíîìóõôòúûùç and I want to replace them with respectively aaaaeeeiiioooouuuc

As an example of what I want:

name        | unaccent          
Vitória     | Vitoria
João        | Joao
Maurício    | Mauricio

I found this example but it doesn't work for these special characters Pyspark removing multiple characters in a dataframe column

I've tried to manually create this df but for some reason I couldn't replicate the special characters and a question mark ? shows up:

from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType

df = spark.createDataFrame(data=[("Vitória",),("João",),("Maurício",)], schema= StructType([StructField("A", StringType(),True)]))
df.show()

+--------+
|       A|
+--------+
| Vit?ria|
|    Jo?o|
|Maur?cio|
+--------+

When I use the translate function this is the result

df.select("A",
          F.translate(F.col("A"), "áãâàéêèíîìóõôòúûùç", "aaaaeeeiiioooouuuc").alias("unaccent")).show()

+--------+--------+
|       A|unaccent|
+--------+--------+
| Vit?ria| Vitaria|
|    Jo?o|    Joao|
|Maur?cio|Mauracio|
+--------+--------+

Any thoughts on how to unaccent these special characters?

It seems like the problem is in your IDE, not in PySpark.

My environment: jupiter notebook in VS Code (macos):

df.withColumn(
    "unaccent", 
    F.translate(F.col("A"), "áãâàéêèíîìóõôòúûùç", "aaaaeeeiiioooouuuc")
).show()

results in a correct output:

+--------+--------+
|       A|unaccent|
+--------+--------+
| Vitória| Vitoria|
|    João|    Joao|
|Maurício|Mauricio|
+--------+--------+

(spark.version = 3.2.1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM