简体   繁体   中英

Adding a list to a dataframe in Scala / Spark such that each element is added to a separate row

say for example I have a dataframe in the following format (in reality is a lot more documents):

df.show()

//output
    +-----+-----+-----+
    |doc_0|doc_1|doc_2|
    +-----+-----+-----+
    |  0.0|  1.0|  0.0|
    +-----+-----+-----+
    |  0.0|  1.0|  0.0|
    +-----+-----+-----+
    |  2.0|  0.0|  1.0|
    +-----+-----+-----+

// ngramShingles is a list of shingles
println(ngramShingles)

//output
    List("the",  "he ", "e l")

Where the ngramShingles length is equal to the size of the dataframes columns.

How would I get to the following output?

// Desired Output
+-----+-----+-----+-------+
|doc_0|doc_1|doc_2|shingle|
+-----+-----+-----+-------+
|  0.0|  1.0|  0.0|  "the"|
+-----+-----+-----+-------+
|  0.0|  1.0|  0.0|  "he "|
+-----+-----+-----+-------+
|  2.0|  0.0|  1.0|  "e l"|
+-----+-----+-----+-------+

I have tried to add a column via the following line of code:

val finalDf = df.withColumn("shingle", typedLit(ngramShingles))

But that gives me this output:

+-----+-----+-----+-----------------------+
|doc_0|doc_1|doc_2|                shingle|
+-----+-----+-----+-----------------------+
|  0.0|  1.0|  0.0|  ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
|  0.0|  1.0|  0.0|  ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
|  2.0|  0.0|  1.0|  ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+

I have tried a few other solutions, but really nothing I have tried even comes close. Basically, I just want the new column to be added to each row in the DataFrame.

This question shows how to do this, but both answers rely on having a one column already existing. I don't think I can apply those answers to my situation where I have thousands of columns.

You could make dataframe from your list and then join two dataframes together. TO do join you'd need to add an additional column, that would be used for join (can be dropped later):

val listDf = List("the",  "he ", "e l").toDF("shingle")

val result = df.withColumn("rn", monotonically_increasing_id())
   .join(listDf.withColumn("rn", monotonically_increasing_id()), "rn")
   .drop("rn")

Result:

+-----+-----+-----+-------+
|doc_0|doc_1|doc_2|shingle|
+-----+-----+-----+-------+
|  0.0|  1.0|  0.0|    the|
|  0.0|  1.0|  0.0|    he |
|  2.0|  0.0|  1.0|    e l|
+-----+-----+-----+-------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM