在Spark中具有不同标头的DataFrame中导入多个csv

Question

i have multiple Csv which everyone ra variable like this: 我有多个Csv，每个人都这样变量：

cloudiness.csv cloudiness.csv

    +---+---+----------+-------------------+
    |_c0| ID|cloudiness|           datetime|
    +---+---+----------+-------------------+
    |  0|  3|       1.0|2013-11-08 00:00:00|
    |  1|303|       2.0|2013-11-08 00:00:00|
    |  2|306|       3.0|2013-11-08 00:00:00|

temperature.csv 温度.csv

    +---+---+-----------+-------------------+
    |_c0| ID|temperature|           datetime|
    +---+---+-----------+-------------------+
    |  0|  3|        3.0|2013-11-08 00:00:00|
    |  1|303|        4.0|2013-11-08 00:00:00|
    |  2|306|        5.0|2013-11-08 00:00:00|

..and so on, (7 o 8 o this files). ..依此类推，（此文件7 o 8 o）。

I have to merge them in one only DataFrame using Spark (R, Python or Scala) like this: 我必须使用Spark（R，Python或Scala）将它们合并到一个唯一的DataFrame中，如下所示：

    +---+---+-----------+----------+-------------------+
    |_c0| ID|temperature|cloudiness|           datetime|
    +---+---+-----------+----------+-------------------+
    |  0|  3|        3.0|       1.0|2013-11-08 00:00:00|
    |  1|303|        4.0|       2.0|2013-11-08 00:00:00|
    |  2|306|        5.0|       3.0|2013-11-08 00:00:00|

I tried spark.read but it takes too much time, files are 3 GB each one. 我尝试了spark.read，但它花费了太多时间，每个文件都是3 GB。 What is the best method to do it? 最好的方法是什么？

Answer 1

The standard way is to join data frames. 标准方法是联接数据帧。

when you read csv files using below snippet 当您使用以下代码段读取csv文件时

val read_csv1 = sc.textFile("HDFS Path to read the file") val read_csv1 = sc.textFile（“ HDFS读取文件的路径”）

RDD will be created and you can join with other CSV's. RDD将被创建，您可以与其他CSV一起加入。 If you mention performance issue means. 如果提到性能问题的意思。 let me give you another way. 让我给你另一种方式。

在Spark中具有不同标头的DataFrame中导入多个csv

问题描述

1 个解决方案

解决方案1
0 2018-10-03 08:07:04

在Spark中具有不同标头的DataFrame中导入多个csv

问题描述

1 个解决方案

解决方案1 0 2018-10-03 08:07:04

解决方案1
0 2018-10-03 08:07:04