简体   繁体   English

Scala-将数据帧写入csv时如何将定界符作为变量传递

[英]Scala - how to pass delimiter as a variable when writing dataframe as csv

Using a variable as a delimiter to dataframe.write.csv is not working. 使用变量作为dataframe.write.csv的分隔符不起作用。 Trying out alternatives is working out to be too complicated. 尝试替代方案太复杂了。

 val df = Seq(("a", "b", "c"), ("a1", "b1", "c1")).toDF("A", "B", "C")
 val delim_char = "\u001F"

 df.coalesce(1).write.option("delimiter", delim_char).csv("file:///var/tmp/test")  // Does not work -- error related to too many chars
 df.coalesce(1).write.option("delimiter", "\u001F").csv("file:///var/tmp/test")  //works fine...

I have tried .toHexString, and many other alternatives... 我已经尝试过.toHexString和许多其他替代方法...

Your declaration works very well. 您的声明效果很好。 It works for both when you give direct string value or pass reference variable. 当您提供直接字符串值或传递引用变量时,它都适用。 And you will get character length error only if you enclose the delimiter value in single quotes '\' . 而且,仅当将定界符值括在单引号'\'才会出现字符长度错误。 It has nothing to do with Scala 2.11.8 . 它与Scala 2.11.8无关。

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://xx.x.xxx.xx:xxxx
Spark context available as 'sc' (master = local[*], app id = local-1535083313716).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.2.6.3.0-235
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import java.io.File
import java.io.File

scala> import org.apache.spark.sql.{Row, SaveMode, SparkSession}
import org.apache.spark.sql.{Row, SaveMode, SparkSession}

scala> val warehouseLocation = new File("spark-warehouse").getAbsolutePath
warehouseLocation: String = /usr/hdp/2.6.3.0-235/spark2/spark-warehouse

scala> val spark = SparkSession.builder().appName("app").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
18/08/24 00:02:25 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@37d3e740

scala> import spark.implicits._
import spark.implicits._

scala> import spark.sql
import spark.sql

scala> val df = Seq(("a", "b", "c"), ("a1", "b1", "c1")).toDF("A", "B", "C")
df: org.apache.spark.sql.DataFrame = [A: string, B: string ... 1 more field]

scala> val delim_char = "\u001F"
delim_char: String = ""

scala> df.coalesce(1).write.option("delimiter", delim_char).csv("file:///var/tmp/test")

scala>

Thank you for your help. 谢谢您的帮助。

The code above works, when tested, and I could not find a way to showcase how the problem was being generated. 上面的代码在经过测试后可以正常工作,我无法找到一种方法来演示问题是如何产生的。 However, the problem was that, there was a variable assigned to a string (which was Unicode "\", println was showing the result as String: \), after it was collected from a csv file. 但是,问题在于,从csv文件中收集了一个变量后,该变量分配给了一个字符串(Unicode为“ \\ u001F”,println将结果显示为字符串:\\ u001F)。

Several approaches were tried. 尝试了几种方法。 Finally found the solution in another Stackoverflow question related to string unicode ... 终于在另一个与字符串unicode有关的Stackoverflow问题中找到了解决方案...

1) Did not Work -- delim_char.format("unicode-escape") 1)无效-delim_char.format(“ unicode-escape”)

2) Worked -- 2)工作-

def unescapeUnicode(str: String): String = 
     """\\u([0-9a-fA-F]{4})""".r.replaceAllIn(str, 
     m => Integer.parseInt(m.group(1), 16).toChar.toString)

unescapeUnicode(delim_char)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM