[英]Unicode issue with csv and PySpark
I have a PySpark dataframe with unicode characters like this:我有一个带有 unicode 字符的 PySpark 数据框,如下所示:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([{"a": 0, "b": 1, "c": "somestring\u0001bla"}])
I want to eliminate this, either through reading or writing a new csv file.我想通过读取或写入新的 csv 文件来消除这种情况。 I have tried different options:我尝试了不同的选择:
option("encoding", "UTF-8")
option("nullValue", "\u0001")
option("encoding", "ISO-8859-1")
and reading with with various encoding options, but nothing works.并使用各种编码选项阅读,但没有任何效果。 Any advice on how to do it?关于如何做到这一点的任何建议?
Here is the code to eliminate those characters,这是消除这些字符的代码,
df = spark.createDataFrame([{"a": 0, "b": 1, "c": "somestring\bla"}])
df.show()
Cant show that special character while pasting as a text here.在此处粘贴为文本时无法显示该特殊字符。 attached as a image."https://i.stack.imgur.com/21Nh0.png"作为图像附加。“https://i.stack.imgur.com/21Nh0.png”
| a| b| c|
+---+---+--------------+
| 0| 1|somestringbla|
+---+---+--------------+
df.createOrReplaceTempView("data")
spark.sql("select regexp_replace(c,'\u0001','' ) from data").show()
+----------------------+
|regexp_replace(c, , )|
+----------------------+
| somestringbla|
+----------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.