![](/img/trans.png)
[英]Apply UTF8 encoding when writing Scala Dataframe into CSV file
[英]Scala - how to pass delimiter as a variable when writing dataframe as csv
使用變量作為dataframe.write.csv的分隔符不起作用。 嘗試替代方案太復雜了。
val df = Seq(("a", "b", "c"), ("a1", "b1", "c1")).toDF("A", "B", "C")
val delim_char = "\u001F"
df.coalesce(1).write.option("delimiter", delim_char).csv("file:///var/tmp/test") // Does not work -- error related to too many chars
df.coalesce(1).write.option("delimiter", "\u001F").csv("file:///var/tmp/test") //works fine...
我已經嘗試過.toHexString和許多其他替代方法...
您的聲明效果很好。 當您提供直接字符串值或傳遞引用變量時,它都適用。 而且,僅當將定界符值括在單引號'\'
才會出現字符長度錯誤。 它與Scala 2.11.8
無關。
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://xx.x.xxx.xx:xxxx
Spark context available as 'sc' (master = local[*], app id = local-1535083313716).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0.2.6.3.0-235
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import java.io.File
import java.io.File
scala> import org.apache.spark.sql.{Row, SaveMode, SparkSession}
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
scala> val warehouseLocation = new File("spark-warehouse").getAbsolutePath
warehouseLocation: String = /usr/hdp/2.6.3.0-235/spark2/spark-warehouse
scala> val spark = SparkSession.builder().appName("app").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
18/08/24 00:02:25 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@37d3e740
scala> import spark.implicits._
import spark.implicits._
scala> import spark.sql
import spark.sql
scala> val df = Seq(("a", "b", "c"), ("a1", "b1", "c1")).toDF("A", "B", "C")
df: org.apache.spark.sql.DataFrame = [A: string, B: string ... 1 more field]
scala> val delim_char = "\u001F"
delim_char: String = ""
scala> df.coalesce(1).write.option("delimiter", delim_char).csv("file:///var/tmp/test")
scala>
謝謝您的幫助。
上面的代碼在經過測試后可以正常工作,我無法找到一種方法來演示問題是如何產生的。 但是,問題在於,從csv文件中收集了一個變量后,該變量分配給了一個字符串(Unicode為“ \\ u001F”,println將結果顯示為字符串:\\ u001F)。
嘗試了幾種方法。 終於在另一個與字符串unicode有關的Stackoverflow問題中找到了解決方案...
1)無效-delim_char.format(“ unicode-escape”)
2)工作-
def unescapeUnicode(str: String): String =
"""\\u([0-9a-fA-F]{4})""".r.replaceAllIn(str,
m => Integer.parseInt(m.group(1), 16).toChar.toString)
unescapeUnicode(delim_char)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.