简体   繁体   English

spark csv 数据源无法写入前导或尾随控制字符

[英]spark csv datasoruce unable to write leading OR trailing contrl charector

val value:String = "\u0001"+ "V1" + "\u0002"
val df  = Seq((value)).toDF("f1")
df.show

Now df is having proper value for field f1.现在 df 对于字段 f1 具有适当的值。 But while writing using spark in build csv format with below code, the ^A, ^B characters are not showing in output.但是,在使用以下代码以 build csv 格式使用 spark 编写时,^A、^B 字符未显示在输出中。

df.write.format("csv").option("delimiter", "\t").option("codec", "bzip2").save("temp_out")

Here the temp_out output doesnot show any ^A, ^B chraracter for field f1此处的 temp_out 输出未显示字段 f1 的任何 ^A、^B 字符

Looking forward some suggestions.期待一些建议。

If Spark's save operation is dropping certain characters, you'll notice that when you open the CSV file(s), those bytes are missing.如果 Spark 的保存操作删除了某些字符,您会注意到当您打开 CSV 文件时,这些字节丢失了。 First, take a look at the bytes in value :首先,看看value中的字节:

value.getBytes()    # Array[Byte] = Array(1, 86, 49, 2)

saveAsTextFile has been around for a while, and is a bit more straightforward. saveAsTextFile已经有一段时间了,它更简单一些。 If you can't get the CSV option to work, this is a good workaround.如果您无法使用 CSV 选项,这是一个很好的解决方法。

df.rdd.map(_.mkString("\t")).saveAsTextFile("temp_out")

You'll probably still be able to read the file using the csv method from the reader, without any dropped characters, as below (but you'll want to confirm with your specific setup):您可能仍然可以使用csv方法从阅读器读取文件,而不会丢失任何字符,如下所示(但您需要确认您的特定设置):

spark.read.option("delimiter", "\t").csv("temp_out/").take(1)(0).getString(0).getBytes()
# result is Array[Byte] = Array(1, 86, 49, 2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM