I have a spark df with a string column with special characters such as áãâàéêèíîìóõôòúûùç
and I want to replace them with respectively aaaaeeeiiioooouuuc
As an example of what I want:
name | unaccent
Vitória | Vitoria
João | Joao
Maurício | Mauricio
I found this example but it doesn't work for these special characters Pyspark removing multiple characters in a dataframe column
I've tried to manually create this df but for some reason I couldn't replicate the special characters and a question mark ?
shows up:
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType
df = spark.createDataFrame(data=[("Vitória",),("João",),("Maurício",)], schema= StructType([StructField("A", StringType(),True)]))
df.show()
+--------+
| A|
+--------+
| Vit?ria|
| Jo?o|
|Maur?cio|
+--------+
When I use the translate
function this is the result
df.select("A",
F.translate(F.col("A"), "áãâàéêèíîìóõôòúûùç", "aaaaeeeiiioooouuuc").alias("unaccent")).show()
+--------+--------+
| A|unaccent|
+--------+--------+
| Vit?ria| Vitaria|
| Jo?o| Joao|
|Maur?cio|Mauracio|
+--------+--------+
Any thoughts on how to unaccent these special characters?
It seems like the problem is in your IDE, not in PySpark.
My environment: jupiter notebook in VS Code (macos):
df.withColumn(
"unaccent",
F.translate(F.col("A"), "áãâàéêèíîìóõôòúûùç", "aaaaeeeiiioooouuuc")
).show()
results in a correct output:
+--------+--------+
| A|unaccent|
+--------+--------+
| Vitória| Vitoria|
| João| Joao|
|Maurício|Mauricio|
+--------+--------+
(spark.version = 3.2.1)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.