在 spark csv 数据框中删除列

Question

I have a dataframe to which i do concatenation to its all fields.我有一个数据框，我将它连接到它的所有字段。

After concatenation it becomes another dataframe and finally I write its output to csv file with partitioned on two of its columns.连接后，它成为另一个数据帧，最后我将其输出写入 csv 文件，并在其两列上进行分区。 One of its column is present in first dataframe which I do not want to include in the final output.它的一列出现在我不想包含在最终输出中的第一个数据帧中。

Here is my code:这是我的代码：

val dfMainOutput = df1resultFinal.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
      .select($"LineItem_organizationId", $"LineItem_lineItemId",
       when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition".cast(DataTypes.StringType)).as("DataPartition"),
       when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").as("StatementTypeCode"),
       when($"FFAction_1".isNotNull, concat(col("FFAction_1"), lit("|!|"))).otherwise(concat(col("FFAction"), lit("|!|"))).as("FFAction"))
       .filter(!$"FFAction".contains("D"))

Here I am concatenating and creating another dataframe:在这里，我正在连接并创建另一个数据框：

val dfMainOutputFinal = dfMainOutput.select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.fieldNames.map(c => col(c)): _*).as("concatenated"))

This is what i have tried这是我尝试过的

dfMainOutputFinal
  .drop("DataPartition")
  .write
  .partitionBy("DataPartition","StatementTypeCode")
  .format("csv")
  .option("header","true")
  .option("encoding", "\ufeff")
  .option("codec", "gzip")
  .save("path to csv")

Now i dont want DataPartition column in my output .现在我不想在我的输出中出现 DataPartition 列。

I am doing partition based on DataPartition so i am not getting but because DataPartition is present in the main data frame I am getting it in the output.我正在根据 DataPartition 进行分区，所以我没有得到，但是因为 DataPartition 存在于主数据框中，所以我在输出中得到了它。

QUESTION 1: How can ignore a columns from Dataframe问题 1：如何忽略 Dataframe 中的列

QUESTION 2: Is there any way to add "\" in the csv output file before writing my actual data so that my encoding format will become UTF-8-BOM.问题 2：在写入我的实际数据之前，有什么方法可以在 csv 输出文件中添加"\" ，以便我的编码格式变为 UTF-8-BOM。

As per the suggested answer根据建议的答案

This is what i have tried这是我尝试过的

 val dfMainOutputFinal = dfMainOutput.select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.filter(_ != "DataPartition").fieldNames.map(c => col(c)): _*).as("concatenated"))

But getting below error但低于错误

<console>:238: error: value fieldNames is not a member of Seq[org.apache.spark.sql.types.StructField]
               val dfMainOutputFinal = dfMainOutput.select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.filter(_ != "DataPartition").fieldNames.map(c => col(c)): _*).as("concatenated"))

Below is the question if i have to remove two columns in final output下面是我是否必须在最终输出中删除两列的问题

  val dfMainOutputFinal = dfMainOutput.select($"DataPartition","PartitionYear",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition","PartitionYear").map(c => col(c)): _*).as("concatenated"))

Answer 1

Question 1:问题 1：

The columns you use in df.write.partitionBy() will not be added to the final csv file.您在df.write.partitionBy()使用的列不会添加到最终的 csv 文件中。 They are automatically ignored since the data is encoded in the file structure.因为数据是在文件结构中编码的，所以它们会被自动忽略。 However, if what you mean is to remove it from the concat_ws (and thereby from the file), it is possible to do with a small change:但是，如果您的意思是将它从concat_ws （从而从文件中）中删除，则可以进行一些小的更改：

concat_ws("|^|", 
  dfMainOutput.schema.fieldNames
    .filter(_ != "DataPartition")
    .map(c => col(c)): _*).as("concatenated"))

Here the column DataPartition is filtered away before the concatenation.这里列 DataPartition 在串联之前被过滤掉。

Question 2:问题2：

Spark does not seem to support UTF-8 BOM and it seems to cause problems when reading in files with the format. Spark 似乎不支持UTF-8 BOM并且在读取具有该格式的文件时似乎会导致问题。 I can't think of any easy way to add the BOM bytes to each csv file other than writing a script to add them after Spark has finished.除了在 Spark 完成后编写脚本添加它们之外，我想不出任何简单的方法将 BOM 字节添加到每个 csv 文件。 My recommendation would be to simply use normal UTF-8 formatting.我的建议是简单地使用普通的UTF-8格式。

dfMainOutputFinal.write.partitionBy("DataPartition","StatementTypeCode")
  .format("csv")
  .option("header", "true")
  .option("encoding", "UTF-8")
  .option("codec", "gzip")
  .save("path to csv")

Additionally, according to the Unicode standard , BOM is not recommended.此外，根据Unicode 标准，不建议使用 BOM。

... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. ... UTF-8 既不要求也不推荐使用 BOM，但可能在 UTF-8 数据从使用 BOM 的其他编码形式转换或 BOM 用作 UTF-8 签名的上下文中遇到.

Answer 2

QUESTION 1: How can ignore a columns from Dataframe问题 1：如何忽略 Dataframe 中的列

Ans:答：

val df = sc.parallelize(List(Person(1,2,3), Person(4,5,6))).toDF("age", "height", "weight")

df.columns
df.show()



+---+------+------+
|age|height|weight|
+---+------+------+
|  1|     2|     3|
|  4|     5|     6|
+---+------+------+


val df_new=df.select("age", "height")
    df_new.columns
    df_new.show()

+---+------+
|age|height|
+---+------+
|  1|     2|
|  4|     5|
+---+------+

df: org.apache.spark.sql.DataFrame = [age: int, height: int ... 1 more field]
df_new: org.apache.spark.sql.DataFrame = [age: int, height: int]

QUESTION 2: Is there any way to add "\" in the csv output file before writing my actual data so that my encoding format will become UTF-8-BOM.问题 2：在写入我的实际数据之前，有什么方法可以在 csv 输出文件中添加“\”，以便我的编码格式变为 UTF-8-BOM。

Ans:答：

 String path= "/data/vaquarkhan/input/unicode.csv";

 String outputPath = "file:/data/vaquarkhan/output/output.csv";
    getSparkSession()
      .read()
      .option("inferSchema", "true")
      .option("header", "true")
      .option("encoding", "UTF-8")
      .csv(path)
      .write()
      .mode(SaveMode.Overwrite)
      .csv(outputPath);
}

在 spark csv 数据框中删除列

问题描述

2 个解决方案

解决方案1
1 已采纳 2017-10-09 02:15:10

解决方案2
0 2017-10-07 13:54:48

在 spark csv 数据框中删除列

问题描述

2 个解决方案

解决方案1 1 已采纳 2017-10-09 02:15:10

解决方案2 0 2017-10-07 13:54:48

解决方案1
1 已采纳 2017-10-09 02:15:10

解决方案2
0 2017-10-07 13:54:48