简体   繁体   中英

how to convert seq[row] to a dataframe in scala

Is there any way to convert Seq[Row] into a dataframe in scala. I have a dataframe and a list of strings that have weights of each row in input dataframe.I want to build a DataFrame that will include all rows with unique weights. I was able to filter unique rows and append to seq[row] but I want to build a dataframe. This is my code.Thanks in advance.

 def dataGenerator(input : DataFrame, val : List[String]): Dataset[Row]= {
    val valitr = val.iterator
    var testdata = Seq[Row]()
    var val = HashSet[String]()
    if(valitr!=null) {
      input.collect().foreach((r) => {
        var valnxt = valitr.next()
        if (!valset.contains(valnxt)) {
          valset += valnxt
          testdata = testdata :+ r
        }
      })
    }
//logic to convert testdata as DataFrame and return
}

You said that 'val is calculated using fields from inputdf itself'. If this is the case then you should be able to make a new dataframe with a new column for the 'val' like this:

+------+------+
|item  |weight|
+------+------+
|item 1|w1    |
|item 2|w2    |
|item 3|w2    |
|item 4|w3    |
|item 5|w4    |
+------+------+

This is the key thing. Then you will be able to work on the dataframe instead of doing a collect.

What is bad about doing collect? Well there is no point in going to the trouble and overhead of using a distributed big data processing framework just to pull all the data into the memory of 1 machine. See here: Spark dataframe: collect () vs select ()

When you have the input dataframe how you want it, as above, you can get the result. Here is a way that works, which groups the data by the weight column and picks the first item for each grouping.

    val result = input
        .rdd // get underlying rdd 
        .groupBy(r => r.get(1)) // group by "weight" field
        .map(x => x._2.head.getString(0)) // get the first "item" for each weight
        .toDF("item") // back to a dataframe

Then you get the only the first item in case of duplicated weight:

+------+
|item  |
+------+
|item 1|
|item 2|
|item 4|
|item 5|
+------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM