I am reading a csv file which has only data like below
Country State City
MÉXICO Neu Leon Monterrey
MÉXICO Chiapas ATLÁNTICO
I tried reading the file with encoding = 'utf8' and 'ISO-8859-1' in pyspark dataframe but values are getting changed like below -
In option("encoding", "utf-8")
Country State City
M�XICO Neu Leon Monterrey
M�XICO Chiapas ATL�NTICO
In option("encoding", "ISO-8859-1")
Country State City
M?XICO Neu Leon Monterrey
M?XICO Chiapas ATL?NTICO
here is the spark read statement
spark.read.format("csv").option("quote", "\"").option("escape", "\"").option('multiLine', True).option("encoding", "ISO-8859-1").option("header", "true").load("country.csv")
option("encoding", "mbcs") and option("encoding", "ansi") gives errror.
What can I do to retain the original text which is in input file?? Thanks in advance
Rread it in without encoding, then create a new column:
df.withColumn("some_col_name", decode(col("column_name"), "ISO-8859-1"))
# One of these will give you what you need. ('US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.