spark csv 数据源无法写入前导或尾随控制字符

Question

val value:String = "\u0001"+ "V1" + "\u0002"
val df  = Seq((value)).toDF("f1")
df.show

Now df is having proper value for field f1.现在 df 对于字段 f1 具有适当的值。 But while writing using spark in build csv format with below code, the ^A, ^B characters are not showing in output.但是，在使用以下代码以 build csv 格式使用 spark 编写时，^A、^B 字符未显示在输出中。

df.write.format("csv").option("delimiter", "\t").option("codec", "bzip2").save("temp_out")

Here the temp_out output doesnot show any ^A, ^B chraracter for field f1此处的 temp_out 输出未显示字段 f1 的任何 ^A、^B 字符

Looking forward some suggestions.期待一些建议。

Answer 1

If Spark's save operation is dropping certain characters, you'll notice that when you open the CSV file(s), those bytes are missing.如果 Spark 的保存操作删除了某些字符，您会注意到当您打开 CSV 文件时，这些字节丢失了。 First, take a look at the bytes in value :首先，看看value中的字节：

value.getBytes()    # Array[Byte] = Array(1, 86, 49, 2)

saveAsTextFile has been around for a while, and is a bit more straightforward. saveAsTextFile已经有一段时间了，它更简单一些。 If you can't get the CSV option to work, this is a good workaround.如果您无法使用 CSV 选项，这是一个很好的解决方法。

df.rdd.map(_.mkString("\t")).saveAsTextFile("temp_out")

You'll probably still be able to read the file using the csv method from the reader, without any dropped characters, as below (but you'll want to confirm with your specific setup):您可能仍然可以使用csv方法从阅读器读取文件，而不会丢失任何字符，如下所示（但您需要确认您的特定设置）：

spark.read.option("delimiter", "\t").csv("temp_out/").take(1)(0).getString(0).getBytes()
# result is Array[Byte] = Array(1, 86, 49, 2)

spark csv 数据源无法写入前导或尾随控制字符

问题描述

1 个解决方案

解决方案1
0 2020-09-07 00:49:41

spark csv 数据源无法写入前导或尾随控制字符

问题描述

1 个解决方案

解决方案1 0 2020-09-07 00:49:41

解决方案1
0 2020-09-07 00:49:41