简体   繁体   中英

Unicode issue with csv and PySpark

I have a PySpark dataframe with unicode characters like this:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([{"a": 0, "b": 1, "c": "somestring\u0001bla"}])

resulting in: Pyspark 截图

I want to eliminate this, either through reading or writing a new csv file. I have tried different options:

option("encoding", "UTF-8")
option("nullValue", "\u0001")
option("encoding", "ISO-8859-1")

and reading with with various encoding options, but nothing works. Any advice on how to do it?

Here is the code to eliminate those characters,

df = spark.createDataFrame([{"a": 0, "b": 1, "c": "somestring\bla"}])
df.show()

Cant show that special character while pasting as a text here. attached as a image."https://i.stack.imgur.com/21Nh0.png"

|  a|  b|             c|
+---+---+--------------+
|  0|  1|somestringbla|
+---+---+--------------+
df.createOrReplaceTempView("data")
spark.sql("select regexp_replace(c,'\u0001','' ) from data").show()
+----------------------+
|regexp_replace(c, , )|
+----------------------+
|         somestringbla|
+----------------------+


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM