如何在Spark中创建具有29列的数据集

Question

I am trying to create a dataset from an RDD. 我正在尝试从RDD创建数据集。 Here is my code : 这是我的代码：

val word = lines.map(_.value())
                        word.print()
                        word.foreachRDD( rdd => {
                            for(item <- rdd.collect().toArray) 
                            {
                                val s=item.split(",")
                                if(s.length ==37){
                            val collection = sc.parallelize(Seq((s(0),s(1),s(2),s(3),s(4),s(5),s(6),s(7),s(8),s(9),
                                s(10),s(11),s(12),s(29),s(30),s(31),s(32),s(33),s(34),s(35),s(36))));


                                val dataset = sc.parallelize(Seq((s(0),s(1),s(2),s(3),s(4),s(5),s(6),s(7),s(8),s(9),
                                s(10),s(11),s(12),s(13),s(14),s(15),s(16),s(17),s(18),s(19),s(20),s(21),s(22),s(23),s(24),
                                s(25),s(26),s(27),s(28)))
                                ).toDS()

                            }
                            }

When I compile the above it is throwing the below error :too many elements for tuple 29, allowed 22 Scala version 2.11.11 Spark version 2.2.0 当我编译上面的代码时，它抛出以下错误：元组29的元素过多，允许22 Scala版本2.11.11 Spark版本2.2.0

Answer 1

you can create a case class (you can change the data type to your requirement) 您可以创建case class （可以根据需要更改data type ）

case class dataset(col1: Int, col2: Int, col3: Int, col4: Int, col5: Int, col6: Int, col7: Int, col8: Int, col9: Int, col10: Int, col11: Int, col12: Int, col13: Int, col14: Int, col15: Int, col16: Int, col17: Int, col18: Int, col19: Int, col20: Int, col21: Int, col22: Int, col23: Int, col24: Int, col25: Int, col26: Int, col27: Int, col28: Int, col29: Int, col30: Int, col31: Int, col32: Int, col33: Int, col34: Int, col35: Int, col36: Int, col37: Int)

And use the case class in your code. 并在代码中使用case class 。

I am going to create temporary string for test purpose and split the string as you are doing in the code and finally use the case class . 我将创建用于测试目的的临时字符串，并像在代码中一样split字符串，最后使用case class 。

val item = "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37"
val s=item.split(",").map(_.toInt)
Seq(dataset(s(0),s(1),s(2),s(3),s(4),s(5),s(6),s(7),s(8),s(9),s(10),s(11),s(12),s(13),s(14),s(15),s(16),s(17),s(18),s(19),s(20),s(21),s(22),s(23),s(24),s(25),s(26),s(27),s(28),s(29),s(30),s(31),s(32),s(33),s(34),s(35),s(36))).toDS().show

this should give you dataset with 37 columns 这应该为您提供37列的dataset

+----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|col1|col2|col3|col4|col5|col6|col7|col8|col9|col10|col11|col12|col13|col14|col15|col16|col17|col18|col19|col20|col21|col22|col23|col24|col25|col26|col27|col28|col29|col30|col31|col32|col33|col34|col35|col36|col37|
+----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|   1|   2|   3|   4|   5|   6|   7|   8|   9|   10|   11|   12|   13|   14|   15|   16|   17|   18|   19|   20|   21|   22|   23|   24|   25|   26|   27|   28|   29|   30|   31|   32|   33|   34|   35|   36|   37|
+----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

You can change the case class and implementation to suit your 29 columns 您可以更改案例类和实现以适合您的29列

如何在Spark中创建具有29列的数据集

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-09-21 11:40:14

如何在Spark中创建具有29列的数据集

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-09-21 11:40:14

解决方案1
2 已采纳 2017-09-21 11:40:14