[英]Scala - Writing dataframe to a file as binary
I have a hive table of type parquet, with column Content
storing various documents as base64 encoded. 我有一个镶木地板类型的配置单元表,其中“
Content
列存储以base64编码的各种文档。
Now, I need to read that column and write into a file in HDFS, so that the base64 column will be converted back to a document for each row. 现在,我需要读取该列并写入HDFS中的文件,以便将base64列转换为每一行的文档。
val profileDF = sqlContext.read.parquet("/hdfspath/profiles/");
profileDF.registerTempTable("profiles")
val contentsDF = sqlContext.sql(" select unbase64(contents) as contents from profiles where file_name'file1'")
Now that contentDF
is storing the binary format of a document as a row, which I need to write to a file. 现在,
contentDF
将文档的二进制格式存储为一行,我需要将其写入文件。 Tried different options but couldn't get back the dataframe content to a file. 尝试了其他选项,但无法将数据框内容恢复到文件中。
Appreciate any help regarding this. 感谢有关此的任何帮助。
I would suggest save as parquet: 我建议另存为实木复合地板:
https://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/sql/DataFrameWriter.html#parquet(java.lang.String) https://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/sql/DataFrameWriter.html#parquet(java.lang.String)
Or convert to RDD and do save as object: 或转换为RDD并保存为对象:
https://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/rdd/RDD.html#saveAsObjectFile(java.lang.String) https://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/rdd/RDD.html#saveAsObjectFile(java.lang.String)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.