Pyspark - Reading a csv file and retaining the original special characters

Question

I am reading a csv file which has only data like below

Country        State      City
MÉXICO         Neu Leon   Monterrey    
MÉXICO         Chiapas    ATLÁNTICO

I tried reading the file with encoding = 'utf8' and 'ISO-8859-1' in pyspark dataframe but values are getting changed like below -

In option("encoding", "utf-8")

Country          State      City
Mï¿½XICO         Neu Leon   Monterrey    
Mï¿½XICO         Chiapas    ATLï¿½NTICO

In option("encoding", "ISO-8859-1")

Country        State      City
M?XICO         Neu Leon   Monterrey    
M?XICO         Chiapas    ATL?NTICO

here is the spark read statement

spark.read.format("csv").option("quote", "\"").option("escape", "\"").option('multiLine', True).option("encoding", "ISO-8859-1").option("header", "true").load("country.csv")

option("encoding", "mbcs") and option("encoding", "ansi") gives errror.

What can I do to retain the original text which is in input file?? Thanks in advance

Answer 1

Rread it in without encoding, then create a new column:

df.withColumn("some_col_name", decode(col("column_name"), "ISO-8859-1"))

# One of these will give you what you need. ('US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16')

Pyspark - Reading a csv file and retaining the original special characters

Question

1 answers

solution1
0 2020-10-13 14:55:11

Pyspark - Reading a csv file and retaining the original special characters

Question

1 answers

solution1 0 2020-10-13 14:55:11

solution1
0 2020-10-13 14:55:11