简体   繁体   English

在Spark中具有不同标头的DataFrame中导入多个csv

[英]Import multiple csv in a DataFrame with different headers in Spark

i have multiple Csv which everyone ra variable like this: 我有多个Csv,每个人都这样变量:

cloudiness.csv cloudiness.csv

    +---+---+----------+-------------------+
    |_c0| ID|cloudiness|           datetime|
    +---+---+----------+-------------------+
    |  0|  3|       1.0|2013-11-08 00:00:00|
    |  1|303|       2.0|2013-11-08 00:00:00|
    |  2|306|       3.0|2013-11-08 00:00:00|

temperature.csv 温度.csv

    +---+---+-----------+-------------------+
    |_c0| ID|temperature|           datetime|
    +---+---+-----------+-------------------+
    |  0|  3|        3.0|2013-11-08 00:00:00|
    |  1|303|        4.0|2013-11-08 00:00:00|
    |  2|306|        5.0|2013-11-08 00:00:00|

..and so on, (7 o 8 o this files). ..依此类推,(此文件7 o 8 o)。

I have to merge them in one only DataFrame using Spark (R, Python or Scala) like this: 我必须使用Spark(R,Python或Scala)将它们合并到一个唯一的DataFrame中,如下所示:

    +---+---+-----------+----------+-------------------+
    |_c0| ID|temperature|cloudiness|           datetime|
    +---+---+-----------+----------+-------------------+
    |  0|  3|        3.0|       1.0|2013-11-08 00:00:00|
    |  1|303|        4.0|       2.0|2013-11-08 00:00:00|
    |  2|306|        5.0|       3.0|2013-11-08 00:00:00|

I tried spark.read but it takes too much time, files are 3 GB each one. 我尝试了spark.read,但它花费了太多时间,每个文件都是3 GB。 What is the best method to do it? 最好的方法是什么?

The standard way is to join data frames. 标准方法是联接数据帧。

when you read csv files using below snippet 当您使用以下代码段读取csv文件时

val read_csv1 = sc.textFile("HDFS Path to read the file") val read_csv1 = sc.textFile(“ HDFS读取文件的路径”)

RDD will be created and you can join with other CSV's. RDD将被创建,您可以与其他CSV一起加入。 If you mention performance issue means. 如果提到性能问题的意思。 let me give you another way. 让我给你另一种方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从Spark中具有不同标头的多个csv文件创建一个数据帧 - Create one dataframe from multi csv files with different headers in Spark 在python中,将具有不同标头的多个CSV读入一个数据帧 - In python, reading multiple CSV's, with different headers, into one dataframe Import multiple csv files into pandas and concatenate into one DataFrame where 1st column same in all csv and no headers of data just file name - Import multiple csv files into pandas and concatenate into one DataFrame where 1st column same in all csv and no headers of data just file name 在 Spark 中读取多个 CSV 文件并制作 DataFrame - Reading multiple CSV files in Spark and make a DataFrame 把一个spark dataframe拆分成多个frame写成CSV - Split a spark dataframe into multiple frames and write as CSV spark python读取多个csv到数据框 - spark python read multiple csv's to dataframe 将具有不同列数的多个 csv 文件读取到数据块中的单个 spark 数据帧中 - Reading multiple csv files with different numbers of columns into a single spark dataframe in databricks 附加具有不同标题顺序的多个CSV文件 - Append multiple CSV files with different order of headers 如何合并具有不同标题的多个 csv 文件? - how to merge multiple csv files with ​different headers? 将多个具有不同标头的 csv 文件组合起来 - Combining multiple csv files with different headers
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM