Handling Umlauted characters in PySpark

Question

I have struck an issue where in I want pandas df created from a spark df, to understand Umlauted characters.

This is a minimal reproducible example:

from pyspark.sql.types import StructType,StructField, StringType
data =[("Citroën",)]
schema = StructType([ \
    StructField("car",StringType(),True), \
  ])
df = spark.createDataFrame(data=data,schema=schema)

The spark df looks like this

+--------
car     |
+--------
|Citroën|

I want to convert the spark df into a pandas df. I try this via df.toPandas() and these are some outputs I get:

pdf = df.toPandas()
print(pdf)
print(pdf["car"].unique())

0  Citro??n
[u'Citro\xc3\xabn']

Question: How do I get Pandas to understand these special characters?

I tried to browse on forums and SO itself. Cannot find anything that works for me. I have tried setting PYTHONIOENCODING=utf8 as suggested by this . Have also tried adding #-*- coding: UTF-8 -*- to the .py file.

UPDATE 1

Converting the pandas df back to spark:

test_sdf = spark.createDataFrame(pdf)
test_sdf.show()
+--------+
|     car|
+--------+
|CitroÃ«n|
+--------+

Answer 1

I think the encoding should be fine. To check you could try a word with just regular letters in.

But I think the problem is the data structure itself. Try moving the comma, so data contains a list of one tuple. The parentheses by themselves won't make a tuple, but putting the comma in there will force it into a tuple in the list.

data =[("Citroën",)]

I don't have any issues with pandas understanding these characters - it may just be the way your system is displaying the output. You could test this by converting back to spark and see if it looks the same as before.

Edit - showing pandas working... This works fine for me:

import pandas as pd
print(pd.DataFrame({'car':['Citroën']}))

You could try:

pdf["car"] = pdf["car"].str.decode('utf-8')

Handling Umlauted characters in PySpark

Question

1 answers

solution1
0 2022-05-26 20:14:09

Handling Umlauted characters in PySpark

Question

1 answers

solution1 0 2022-05-26 20:14:09

solution1
0 2022-05-26 20:14:09